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TIMING CODES ON THE CRAY-1: PRINCIPLES AND APPLICATIONS 

ABSTRACT 

Complete instruction-timing information for the CRAY-1 computer is 
presented together with a method of recording the minimum necessary details 
for precise prediction of the running! time of various algorithms. Several 
examples of optimum .assembly language coding are listed, with comments that 
illustrate the timing details. Usage of the code CYCLES which predicts ■ 
timing of act.ual CAL, CFT, or CIVIC programs is described. Usaqe of codes 1 
TIMER end TALLY is described. " g 

I . INTRODUCTION 

The aim of this document is to show how to locate and analyze the 

segments of a code that are important from a timing viewpoint. Computer 
codes TIMER and TALLY are useful for this purpose. Then, having identified 
critical sections, we consider how to perform them optimally. Comouter code 
CYCLKS is of value in obtainiig such performance. 

On the CRAY-1, optimum progrsmm i ng consists of finding the best 
algorithm and avoiding conflicts in implementing it. Usually the best 

algorithm can he characterized at- a "parallel vector" algorithm. 

Once an algorithm has been decided upon, one must consider how it can be 

implemented with actual hardware instructions. The algorithm may have to be 
changed if it causes unavoidable conflicts due to the shared nature of the 
CRAY-1 's data paths, registers, functional units, and memory. Avoiding 
conflicts if. primarily a matter of understanding the timing details involved. 

Several examples of improved performance achieved through timing 
analysis will be given. (For a description of the environment at LLNL in 
which your code will run, see Appendix B.) g 
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II. OVERALL. TIME ANALYSIS ON THE CRAY-1 
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t step in improving the performance of a code is to find out 

pending its time. In most programs there is some small 

orithm that uses the majority of the CPU time. Thus, 

to a very limited number of lines of code can result in dramatic 

the amount of time required to perform a calculation. In 
f you have a FORTRAN program in which, say, 70% of the time is 
inner DO loop, you can limit your effort, initially, to making 
to that loop. In such cases, obviously, the use of assembly 
Id be considered. Much of this report will be concerned with 

of relatively small assembly language routines. However, 
look at full code analysis. 



Code Timing with TIMEF? and TALLY 



The L.ASNEX code group, primarily Jim Kohn and George Zimmerman, has put 
together a simple set of tools to do code timing on the CRAY-1 (and 7600). 
The capabilities Are similar to BEGI NMAP-ENDMAP but are simpler to use. The 
output produced by this set of tools is much less extensive than BEGINMAP but 
contains the essential ingredients to do timing analysis for almost any code. 

T imer 
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like: 



TIMER is a subroutine which you must call in your code. The call looks 



where, 
IOC 



FNAME 



CALL TIMER (IOC, ' FNAME' , BUFFER, LBUFFER, ' HEADER' , LHEADER) 



is an I/O Connector (IOC) available for I/O. However, if this IOC 
ever becomes unavailable, TIMER tries to find another one. The 
IOC is active only during actual writes to disk by TIMER. I0C=0 
is satisfactory. 

is a file sequence name. A sequenced name is formed from this by 
appending a digit (usually 0) on the right end of the name 
truncating the leftmost character if necessary. If FNAME already 
ends with a decimal digit, FNAME is used as is for the first file 
in the sequence, If any file in the sequence already exists it 
will be destroyed. 

BUFFER is an I/O buffer. It must be permanently available and reserved 

for TIMER'S use only. Otherwise garbage could be written to disk. 
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LBUFFER is the length of the 1/0 buffer. It may be any size convenient 
for the user. 512 words seems to work quite well. 



HEADER 



is ztn ASCII string which will be written into the beginning of the ■ 

disk file to identify this timing file (in case multiple runs are ■ 

mode). Date, time, code name, problem name are some possible ■ 

items that you may wish to put in the header. ■ 



is the word length of HEADER. It must be at least one word long, 

even if the header itself is blank. " i 



TIMER oporat.es by interrupting your code every 4 milliseconds and 

finding out what the p-counter is. It stores the p-counter in the buffer, M 

dumps the buffer if necessary, and then returns from interrupt. TIMER itself B 

does not perform any actual timing analysis. It just creates a timing file ■ 

with p-counters. in it. The actual analysis is, done by the TALLY code. ■ 

To obtain a complete timing analysis of your code, TIMER should be 9 

called as early as pcssihle during the execution of your code. Once the call B 

to i (r'il-R has be^n made, no other calls are required until your code wants to ■ 

teririnate the timing .analysis. Your code should not be affected by the 1 

presence of TIMER in it. The overhead is approximately 5 microseconds per ■ 

interrupt, which should lot be detectable. TIMER contains only about 100 ■ 

lines of FORTRAN so ,. t is very small. n 

To terminate the timing analysis, a call must be made to TIMEND. TIMEND ■ 

is called with no arguments. It shut© down the timing, flushes the buffer, ■ 

closes the file and truncates it. TIMtND is an entry point inside TIMER. 1 

No externals are required by TIMER (or TIMEND). It is self-contained. ■ 

It is available by loading your code with ALIBCRAY. If you cannot access S 

ALIBCRAY, the source for TIMER may be extracted from file CLASS, and compiled 3 

to produce a binary file for LOR . g 

■ 

TIMER stores one other piece of information in the timing file along ■ 

with the p-counter. This is a process index. This index is read from common ■ 

block /Q8LDBKX/ which is one word long. By default this word is set to 1 . I 

Your code may set this word at any time to designate the current process ■ 

which is active. The only reason to do this would be to obtain a more ■ 

detailed breakdown of the usage of utility subroutines (e.g., SQRT, LOG, EXP, ■ 

BASEL IB routines, etc. ) according to the structure of your code. For ■ 

example, you could f J nd out which logical process in your code is using SQRT ■ 

the most. This feature is usually used in ovcrlayed (or segmented) codes 1 

where the overlay (or segment) number can be stored into this common block. ■ 

But any single level code could use this equally well. Maximum value for 1 

this process index is 2o5 on CRAY. n 
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The TALLY code requires 2 files in order to do a timing analysis. The ■ 

first is the set of timing files (usually 1 file) produced by the' TIMER B 

routine. The second f i La is the symbol table file produced by the loader. ■ 

The symbol table is usually contained in your control lee file so you may E 

normally use your executing code name as the symbol table file. A copy of i 

TALLY can be extracted from public file "NELSON" , at LLNL. 1 

The execute line to run TALLY is: ■ 

■ 
TALLY t, imi ng-f i le-name symbol -table -f i le [options] / t v i 

B 
where the following options are available, H 

■ 

none (i.e., no options specified). This does a short timing analysis. ■ 

Histograms on a subroutine by subroutine basis are not produced. I 

ALL. This does a complete timing analysis producing all of the output ■ 
TALLY can. Most people use this option. ■ 

■ 

BS . n Set the Bin Size to n parcels. Tally accumulates timing information i 

into bins. Each bin represents n parcels of your code. Default is a 

n = 32 (8 words) which works very nicely. ■ 

E 
The timing analysis produced by TALLY is fairly straightforward to ■ 
understand. It is broken into 3 logical sections. Each sections includes S 
percentage breakdowns as well as actual numbers of hits. The term "hit" ■ 
designates an instance of the p-counter being in a given routine or a given ■ 
bin. ■ 

■ 

The first section does on overall timing analysis, The number of hits ■ 

in each subprogram as well as the percent of the total time the subprogram I 

used is listed. A subprogram appears in this list only if at least 1 hit was H 

recorded within its bounds. B 

The second section does a similar kind of analysis but by process index. ■ 
Thus this is a bit more detailed. The usage of commonly used utility 1 

subprograms is broken up by process index. B 

The third section (if requested with ALL.) is a detailed analysis (via 1 

histogram) of each subprogram for which hits were recorded. The breakdown is 1 

by bins where a bin represents a small section of code. The number of hits ■ 

within a bin is printed along with a 'bar' indicating graphically the i 

relative time spent within the bin. Note that the algorithm determining the ■ 

length of the 'bar' is non- linear. The actual hit count must be used for an ■ 

accurate , detailed aria lysis. 1 



Example of output from TALLY. 

First, for a GRAFLIB 'typical' test problem written to identify those 

routines in which time was being spent. 
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01/27/81 












N!-HT= 998 












LOCATION 


LENGTH 


SUBROUTINE 


NHIT 


PERCENT 


00061626 


00000635 


MA I N . 


3 




.3006 


00063560 


0000001 5 


RNFL 


3 




, 3006 


00064764 


0000051 5 


JPPL2A 


1 




, 1002 


00067425 


00000106 


ZMOVEBIT 


7 




, 7014 


00071430 


00000634 


KXDRPL 


2 




. 2004 


00072275 


00000040 


ZMOVEWRD 


3 




.3006 


00073725 


00000146 


KXVT2D 


126 


12 


6253 


00074073 


00000230 


KXCL2D 


444 


44 


. 4890 


00075740 


00002450 


KPFRLN 


356 


35 


. 6713 


001 10614 


00000070 


QBPAK 


51 


5 


, 1 102 


001 1 3633 


00000041 


1 ZIOSTAT 


2 




. 2004 



■ 



Second, after about one personal month of effort spent recording the 
three main time-consuming routines into CALL. 



03/16/81 

NHIT = 

LOCATION 

00061653 
00063605 
00063622 
00065276 
00070173 
00074424 
00075255 
00077130 
00077 170 
00101335 
001 14365 
001 15112 
001 15202 



21 8 



LENGTH SUBROUTINE NHIT PERCENT 



00000635 

00000015 
000001 21 
00000626 
000001 06 
00000620 
00000040 
00000040 
00000076 
00002410 
00000525 
00000070 
00000060 



MA I N . 

RNFL 

ZCITOA 

JPPL2A 

ZMOVEBIT 

KXDRPL 

ZMOVEWRD 

HKXVT2D 

HCL2D 

KPFRLN 

KFRVEC 

QBPAK 

KWBFFN 



2 




, 9174 


4 
1 


1 


. 8349 

. 4587 


1 




,4587 


4 


1 


. 8349 


1 




. 4587 


4 


1 


, 8349 


7 


3 


. 21 10 


28 


58 


. 7156 


6 


2. 


. 7523 


10 


4. 


5872 



II 

11 



a 

B 

m 



49 22.4771 



1 



4587 



Another month spont devo 1 op i ng and coding vector versions of HCL2D and 
QBPAK reduced them to 34 and 19 hits respectively, and resulted in a final 
tenfold improvement for this heavily used LLNL utility, (NHIT= 97). 
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FLOWTRACE 



Often one would like to find out which subroutines of a large code are 
frequently called and gain an overall knowledge of its flow. CFT users can 
accomplish this by using FLOWTRACE. This is a comp i le-t ime option, which, 
although expensive, doas produce a rather nice breakdown of a code's 
behav i or 1 . 

An example of the output from FLOWTRACE is shown below. Full details 
and assistance are available from the local CRAY representatives. 



ROUTINE 




TIME 


% 


CALLED 


AVERAGE T 




1 FENBTV 





, 059817 


1.18 


1 





, 059817 


CALLS 


THGEN 


2 THGEN 





, 067451 


1 . 33 


23 





. 002933 


CALLED BY 


FENBTV 


3 BCOND 





, 004805 


0. 68 


1 





. 034805 


CALLED BY 


FENBTV 


4 i COND 


0. 


, 023386 


0. 46 


1 





. 023386 


CALLED BY 


FENBTV 


5 PREFRON 


0. 


001 754 


0. 03 


1 


0. 


. 001754 


CALLED BY 


FENBTV 


6 VSTRAP 


0. 


. 087725 


1 . 72 


1 


0. 


, 087 725 


CAL! ED BY 
CALLS 


FENBTV 
OUTSOL 


7 OUTSOL 


1 . 


1 90628 


23. 41 


46 


0. 


025883 


CALLED BY 


VSTRAP 


8 FRONT 


1 . 


455038 


28. SO 


22 


0. 


066138 


CALLED BY 
CALLS 


VSTRAP 
QVSET 


9 QVSET 


0. 


010048 


0. 20 


94 


0. 


0001 07 


CALLED BY 


FRONT 


1 MAKEL 


0, 


035852 


0. 70 


6 


0. 


005975 


CALLED BY 
CALLS 


FRONT 
QVSET 


1 1 BAS I S 


0. 


000900 


0. 02 


9 


0. 


0001 00 


CALLED BY 


MAKEL 


1 2 MAKEQ 


0. 


226627 


4. 45 


132 


0. 


001 71 7 


CALLED BY 
CALLS 


FRONT 
NLMAT 


1 3 NLMAT 


0. 


1 17950 


2. 32 


132 


0. 


000894 


CALLED BY 
CALLS 


MAKEQ 
ENCOM 


1 4 ENCOM 


0. 


017083 


0. 34 


132 


0. 


000129 


CALLED BY 


NLMAT 


15 NLRHS 


0. 


001008 


0. 02 


6 


0. 


000168 


CALLED BY 


MAKEQ 


1 6 BACSUB 


0. 


679884 


13. 37 


22 


0. 


030904 


CALLED BY 


FRONT 


1 7 I TER 


0. 


0583 92 


1.15 


21 


0. 


002781 


CALLED BY 
CALLS 


VSTRAP 
QVSET 


*** TOTAL 


5. 


087028 














*** OVERHEAD 


0. 


033296 














SUBROUTINE L 


I MKAGE 


OVERHEAD 


! SUMMARY 




922 


CALLS 






MINIMUM MAXIMUM 


AVERAGE 




CYCLES 


i SECONDS % 


T REGISTERS 







22 


6. 2 




2859-; 


! 3.57e-0^ 


1 0.0070 


B REGISTERS 




2 


8 


4. 3 




26306 


i 3.29e-0^ 


1 0.0065 


ARGUMENTS 







5 


0. 8 




2876 


> 3.60e-0E 


> 0.0007 


tot a I 












57776 


: 7.22e-0^ 


1 0.0142 


MAXIMUM SUBROUTINE DEPTH = 


7 













Call Second (0) 

Gathering timing information can be made an integral part of a routine. 
A basic tool [ recommend for this use within a specific FORTRAN subroutine is 
the F0RTL1B function SECOND. On the CRAY-1, SECOND returns the total 
unweighted CPU time charged against your code since execution began. Calls 
to SECOND are relatively cheap (approximately 5 microseconds per call) and 
are not subject to variations due to the current time-sharing load on the 
machine. Other techniques may be used for finer analysis of small code 
sections, but for overall purposes SECOND is adequate. An example of its use 
is shown in the code below. 

PROGRAM MF30 1 T ( UN I T59 = TTY ) 

COMMON DC1325) 

DIMENSION (1024) 

CALL LIMK( 'UN!T59=TERMINAL//' ) 

E - SFCOND(O) 





TM = SEC0ND(0)-E 




TT = TM*976. *25. *4. 




T5 = 




T2 = 




X = .125 




Y = .015625 




A = 15.5 




WRITE(59 J 58) A,X,Y 


58 


FORMAT ( 'CHECKING FOR A 




DO 4 K = 1,25 




a = A+X*K 




DO 1 M=l, 1325 


1 


D(M) = B*B-M 




DO 3 J = 1, 976 




C = Y*J 




TA = SECOND(O) 




DO 5 1 = 1,1 024 




F( I ) = (C-B*D( ! ) )/2. 


5 


CONTINUE 




TB = SECOND(O) 




T5 = T5+TB- TA-TM 




TA = SECOND (0) 




DO 2 1=1, 1024 




I F ( F ( I ) . NE . ) GO TO 2 




E = SECOND(O) 




WRITE(59,60) B.C.DIDJ 


60 


FORMAT ('HIT AT',4F9.4,: 


2 


CONTINUE 




TB = SECOND(O) 




T2 = T2 + TB-- TA-TM 


3 


CONTINUE 


4 


CONTI NUE 



F7. A, ■ X = ' , F7.5, ' Y = ' , F8.6) 



E, I , J , 
315) 





E = SECOND (0) 






WRITEC5S, 59) A, E, I 


J,K 




WRITE (59.61) 75,72, 


TT 


61 


FORMAT ( 'LOOPS TIME 


=' , F9. A, 3X, 'LOOP? 




% , ' CLOCK CALL TIME 


=' , FS. 4) 




STOP 1 




59 


FORMAT ( ' 
% 2F9. 4,315) 
END 


A TIME I 



TIME =' J F9.4,3X 

J K',/,'N3 HIT', 



Note: The source code for this example, MF301T, as well as the 

sources for all other examples in this writeup are resident on 
the CRAY-1 in public LIB file CLASS. One can extract and run 
this example using the CIVIC compiler as follows (lower case 
typing represents user input; upper case is computer output): 



I ib class 

C 06/13/79 09:41:03 644400 

OK. x mf301t 

OK . end 

ALL DONE 
civic mf301t mfc 



*** CR/ 1 


-Y LC 


"SADER 


VERSION - 


CI 20 03/08/79 






ALL DONE 














mfc 
















CHECKING FOR A = 


15. 5000 


X = 


1 2500 


Y = 


0. 015625 


HIT AT 


15 


7500 


0. 9844 


0. 0625 


1 . 6406 


248 


63 2 


HIT AT 


16 


2500 


1 . 0156 


. 0625 


7, 801 5 


264 


65 6 


HIT AT 


16 


5000 


4. 1250 


0. 2500 


1 1 . 1936 


272 


264 8 


H I T AT 


16 


7500 


9. 4219 


0. 5625 


14. 8103 


280 


603 10 


HIT AT 


17 


2500 


9. 7031 


0. 5625 


20 . 9958 


297 


621 14 


HIT AT 


17 


5000 


4. 3750 


. 2500 


23 5364 


306 


280 16 


HIT AT 


17 


7500 


1 . 1094 


0. 0625 


26. 2874 


315 


71 18 


HIT AT 


1 8 


2500 


1 . 1 406 


0. 0625 


32. 4474 


333 


73 22 


HIT AT 


18 


5000 


4. 6250 


. 2500 


35. 8796 


342 


296 24 






A 


T I ME 


J J 


K 






NO H I T 


15 


500 


30.4962 1025 977 


26 






LOOPS "1 


I ME 


15 


.6641 L06P2 TIME 


= 1 8 . 4t 


26 


CLOCK CAL 



TIME = 4.2944 



The foil ow i no, 
vector i :;od for 



for compar 
loop 5: 



ison, is the CFT version, which is automatically 



reft, i 
CFOOO 
CF001 
CF002 

* * * CR 
CHECK! 
HIT AT 



HIT 
HIT 
HIT 
HIT 
HIT 
HIT 
H I T 
HIT 



AT 
AT 
AT 
AT 

AT 
AT 
AT 
AT 



= mf 301 t, g 

- CFT VER 

- CGMP I LET 
5 

AY I OA'^FR 
NG FOR A 
15. 7500 



NO HIT 

L00P5 



1 6 

16 

16. 

17, 

17, 

I 7 . 

18, 

18, 

15, 
TIME 



2500 

5000 

7500 

2500 

5000 

7500 

2500 

5000 

A 

5000 



O 

SIGN - 

TIME = 
o. LINES, 
VERSION - 
15.5000 




9344 
0156 
1250 
4219 
7031 
3750 
1094 
1 406 
6250 
TIME 
22. 4518 



1 

4. 
9. 
9. 

4. 
1 

1 

4. 



01/23/81 1 . 09b 

0. 0346 SECONDS 

44 STATEMENTS 
C120 03/08/79 
X = 0. 12500 

. 0625 



1 . 1869 













I 

1025 



0625 
2500 

5625 
5625 
2500 
0625 
0625 
2500 
J 
977 





4 

6 

8 

1 2 

1 3 

15 

1 9 

20 



LOOPS TIME 



95S2 

5498 

5256 

6349 

2467 

7301 

3368 

9301 

9280 
l< 
26 

16. 9523 



248 
264 
272 
280 
297 
306 
315 
333 
342 



0. 015625 



63 

65 

264 

603 

621 

280 

71 

73 

296 



2 
6 
8 
10 
14 
16 
18 
22 
24 



3i 



CLOCK CALL. TIME 



4. 1968 



From these numbers, we cpn see that (for the CFT version, at least) 
improvement efforts should be directed toward loop 2. (And, of course, the 
calls to SECOND will be eventually removed.) 
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IRTC and/or Q8RTC 

The CRAY-1 has a cycle counter as one of its hardware features. This is 
a counter which steps by one each machine clock period of 12.5 nanoseconds. 
Detailed timing of code sections can fc>3 done using this counter. However, 
the counter steps whether or not your program is running, so care must be 
taken with its use in the t imo-shar i rig env i remnant . The counter, called RTC 
(for real-time clock), is directly readable using FORTRAN. With CFT, one 
uses the construct. N = IRTC(O), and with CIVIC, N = QBRTC(O), where N is an 
integer variable name. The compiler gererates only the code necessary for 
reading the RTC and storing the reading in memory location N, a total of 48 
bits of code, normally requiring only 3 extra clock periods to perform. (In 
certain cases a longer time is required because of an S-register, path, or 
memory confl ict . ) 

The use of IRTC is illustrated in the session below. In the example, a 
FORTRAN routine calls a CAI... assembly routine, which adds the first 51 
elements of arrays A and B and places the result into array C by use of a 

scalar loop. 

Here, it was possible to improve the performance of the machine on this 
example by about 6% by merely reordering the modules in memory. There are 
(edmittedly pathological) examples of this type of thing where a change in 
running time of 100% occurs. Such changes are due to the avoidance of (or 
introduction of) conflicts. 

First, the source codes for the example are extracted. 

lib c lass 

C 07/06/79 13:19:51 644400 
OK. x abes abesf 
OK. end 

ALL DONE 
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j I o ! abcs 

19 LINES ( 



80S) 



. t 
1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 1 

12 

13 

14 

15 

16 

17 

10 

19 

. r 

CA 

%P 

C3 

CA 



ABCS 
LOOP 



CAL I =ABCS,B=BABCS, L=LSC 

I DENT ABCS 

COMMON ABCOMMON 

BSS 57 

BSS 56 

BSS 56 

BLOCK ABCS 

ENTRY ABCS 

A1 

A2 51 

51 A,A1 

52 B,A1 

53 S1+FS2 
C,A1 S3 
A1 A1+1 
AO A1-A2 
JAN LOOP 
J BOO 
END 

L I =ABCS. B-BABCS, L=LSC 
C3 
0003 



012 



0062K MEMORY + 0117K I/O BUFFERS USED 



ALL DONE 



2. 

3 

4 

5 

6 

7 

8 

9 

10 

1 1 

12 

13 

14 

15 

16 

17 



abcsf 
17 L 

* 

5K 
X 



59 



I NFS ( SOS) 

CFT I=ABCSF J ON=G, L-LSF J B=BSF 

LDR I - (BSF.BABCS) , ML=MSF, X = XBS J ©RDER=CLNB , FIRST = BSF 
XBS 

COMMON /ABCOMMON/ A C 56 ) , OUTRANGE , B ( 56 ) , C ( 56 ) 

CALL LI NK( ' UNI T59-TERM I NAL// ' ) 

Y = X*X*X*X*X*X*Y*X 

DO 1 I = 1 , 169 

At I ) = I 

OUTRANGE = 600004000000000000000B 

M = IRTC(O) 

CALL ABCS 

N " IRTC(O) 

X = N-M 

wRiTEtsg.sg) c,x 

F0;?MAT(7F6. 0) 

STOP 

END 
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. run 

CFT I =ABCSF J ON = GJ_ = LSF 
FT004 - CFT VERSION - ' 
FTOOI - COMPILE TIME = 



B = BSF 

04/06/79 SCHEDULER 
0.0195 SECONDS 



ALL DONE 
LDR I = (BSF , BABCS) , ML=MSF, X=XBS , OPDER=CLNB, F I RST=BSF 



ALL DONE 












XBS 














59 


61 . 


63. 


65. 


67. 


69. 


71 


73 


75. 


77. 


79. 


81 . 


83. 


85 


87 


89. 


91 . 


93. 


95. 


97. 


99 


101 


103. 


1 05. 


107. 


109. 


111. 


1 13 


1 15 


117. 


119. 


1 21 . 


123. 


125. 


127 


129 


131 . 


133. 


135. 


137. 


139. 


141 


143 


145. 


147. 


149. 


151 . 


153. 


155 


157 


159. 


165. 


166. 


167. 


168. 


169 


1 773 














STOP 















The last number listed (1773) is the number 

between the two uses of I RTC in the code ABCSF. 



of machine cycles elapsing 



Notice, next, the result of an apparently innocuous change to line 2. 

rp2! =BSF! -BABCS 

. nf ! run 

17 LIMES ( SOS) 
CFT I=ABCSF J ON-e,L=L=LSF J B=FSF 
FT004 - CFT VERSION - 04/06/79 SCHEDULER 
FT001 - COMPILE TIME = 0.0191 SECONDS 

ALL DONE 
LDR I = (BSF J BABCS) , KL = MSF, X-XBS , ORDER = CLNB , FIRST = BABCS 



ALL DONE 












XBS 














59 


61 . 


63 


65. 


67. 


69. 


71 


73 


75. 


77 


79. 


81 . 


83. 


85 


87 


89. 


91 


93. 


95. 


97. 


99 


101 


103. 


105 


107. 


109. 


111. 


1 13 


1 1 5 


117. 


1 19 


121 . 


123. 


125. 


127 


129 


131 . 


133 


135. 


1 37. 


139. 


141 


143 


145. 


147 


149. 


151 . 


153. 


155 


157 


159. 


165 


166. 


167. 


168. 


169 


1 659. 














STOP 














ALL DONE 
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Other Methods 

One can u;>e the 072 machine instruction directly to discover ultra-fine 
timing details related to hardware and special code loops. This detail is 

made available to the CRAY-1 programmer through use of the public file 
"CYCLFS". See Section IV for more information. 
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III. PREDICTING TIMING 

The rest of this paper will be used to demonstrate (and, I hope, teach 
you) a method for explicitly predicting timing. The method can help in 
avoiding unnecessary conflicts in assembly- language -coded subroutines or in 
loops which one expects to utilize considerable machine time and for which, 
■therefore, one is justified in spending considerable human time to obtain top 
performance, Since the method outlined is almost completely mechanical, a ' ■ 
program using these ideas has been written to generate timing charts such as ■ 
those shown bo low. The program is called CYCLES. Its usage is described in ■ 
Section IV of this report. ' § 

I will assume that the reader is familiar with the CRAY-1 Hardware 
Manual and CAL assembly language. In particular, the five pages of our 
Appendix A, taken from the CRAY-1 Hardware Manual, list much of the i 

information needed for timing purposes. Examples will be either given in CAL 
or, on occasion, taken directly from the long listing of CFT or CIVIC. 

Geners: I Remarks 



In general, the time required to perform an algorithm depends on the 
specific instructions used to perform it and on the relationships among those 
instructions. A complete understanding of the relevant conditions affecting 
the execution of a particular instruction can be gained only by considering 
its relation to surrounding instructions. In particular, vector instructions 
require somewhat mom; analysis 'than scalars. 

I find that recording at most five easily computed numbers per 
instruction will give th« necessary information for detcsrmining conflicts and 
suggesting ways to avoid them. For a scalar (or register) instruction one 
needs to keep track of: (1) when it issues, and (2) when it completes. For 
a vector instruction on<i has to note: (1) its issue time, (2) its chain 
time, and the (different) times when it has finished using: (3) its input 
registers, (4) its functional unit, and (5) its output register. 

In all cases, except for scalar memory-referencing instructions (and 
normally it is true then, also), once the issue cycle has been determined, 
all the other timing numbers for- that instruction are computable. The rules 
for doing these computations are stated on page 25 of this report, and the 
exceptions srej noted in appropriate examples. 

Table 1 (adapted from Appendix D of the CRAY-1 Hardware Manual) lists 
the entire set of timing numbers (first column) needed for most purposes. 

These specify the number of i2.5 nanosecond machine cycles required by the ■ 

CRAY-1 to deliver a result to the appropriate register. (0 means no result 1 

goes to a register.) r-ui then detail is available in Chapter 4 of the Cray-1 ■ 

Hardware Manual in conjunction with each specific instruction description. i 

Note. All instructions using the Memory Functional Unit are subject to i 
possible additional delays due to memory bank conflicts with I/O. " g 
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Table 1 . I 


istructicn and Timing Summary 




Cy- 
cles 


'CRAY-1 


ICAL it. 


nemon i gs 


! Un i t 


! Descr ipt i on 


CO 




i OOOxxx 


IERR 




1 


Error exit 




50 




' **000i jk 


! ERR 


exp 


- 


Error ex it 









sGOIOOO 


:nop 




- 


Mo operat i on 




1 




**001 Ojk 


!CA,Aj 


Ak 


- 


Set the channel (Aj ) current address 
to (Ak) and begin the I/O sequence 








**001 1 jk. 


CL.Aj 


Ak 


- 


Set the channel (Aj ) limit address to 


(Ak) 






**001 2j>< 


CI , Aj 




- 


Clear- channel (Aj) interrupt flag 








**001 3j>< 


XA 


Aj 


- 


Enter XA register with (Aj) 








**001 4jx 


RT 


Sj 


- 


Enter real-time clock register with (S 


j) 






**001 4j 4 


PCI 


Sj 


- 


Enter II w i th (Sj ) 






**0014j5 


CCI 




- 


Clear clock interrupt 








**0014j6 


ECI 




- 


Enable Clock interrupt 








**001 4j7 


DCI 




- 


Disable clock interrupt 








G020xk 
* 0020x0 


VL 
VL 


Ak 
1 


: 


Transmit (Ak) to VL register 
Transmit 1 to VL register 








0021xx 


EFI 




- 


Enable interrupt on fit pt error 




1 
3 




0022xx 


DFI 




.. 


Disable interrupt on fit pt error 






003x j x 


VM 


Sj 


- 


Transmit (Sj ) to VM register 




3 




*003><0x 


VM 





- 


Clear VM register 




CO 




004xxx 


EX 




- 


Normal exit' 




50 




**004 i jk. 


EX 




- 


Normal exit 




7( + 


) 


OOSxjfcx 


J 


Bjk 


- 


Jump to (Bjk) 




5( ^ 


) 


006 i j km 


J 


exp 


- 


Jump tc exp 




55 ( •: 


) 


007 i jkm 


R 


exp 


- 


Return jump to exp; set BOO to P 




5( ■{ 


) 


1 i j km 


JA2 


exp 


- 


Branch to exp if (AO) = 




5( ■! 


) 


011 i j km 


JAN 


exp 


- 


Branch to exp if (AO).NE.O 




5( + 


) 


01 2 i jk.m 


JAP 


exp 


- 


Branch to exp if (AO) positive 




5( ■"< 


) 


1 3 j j km 


JAM 


exp 


- 


Branch to exp if (AO) negative 




5( •: 


) 


1 4 i j km 


JSZ 


exp 


- 


Branch to oxp if (SO) = 




5( + 


) 


01 5 i jkm 


JSN 


exp 


- 


Branch to exp if (SO).NE.O 




5C + 


) 


01 6 i jkm 


JSP 


exp 


- 


Branch to exp if (SO) positive 




5( + 


) 


01 7 i j km 


JSH 


exp 


- 


Branch to exp if (SO) negative 




1 




020 i j km 






- 


Transmit exp = jkm to Ai 




1 




021 ijkm 


A i 


exp 


- 


Transmit exp = 1's complement 
of jkm to Ai 




1 




022 i jk 


Ai 


exp 


- 


Transmit exp = jk to Ai 




1 




023 i jx 


Ai 


Sj ! 


- 


Transmit ( S j ) to Ai 




1 




024 ijk , 


Ai 


Bjk ! 


! 


Transmit (Bjk) to Ai 





II 
II 
II 

II 

IS 

II 
II 
I 
II 
li 
m 

li! 

ib 
ii 

El 
II 
Ei 
II 
II 
II 
II 
..." 



II 
ii 
ii 
H 
II 
II 



* Special CAL syntax form. 

** Privileged to monitor mode. 

x Indicates that the field is not used by the hardware; the assembler 

generates a zero in this position. 
+ These jump instructions take longer if branched-to address is not already 

in an Instruction buffer. They then use the memory functional unit. 
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Cy- 
c les 



CRAY-1 



CAL mnemonics 



Unit 



Descr ipt i on 



■ 
I 
I 

ft 

i 
I 

■ 

1 
I 
(I 
1! 
■ 
S 
IP 
1 
II 
II 
II 
11 
III 
II 
il 
II 
il 
II 
II 
II 



II 
II 
IB 

II 



1 

4 
4 
3 
2 
2 
2 
2 

2 
2 

2 
6 
4 
4 
4 

1 4( + ) 

14( + ) 

6( + ) 

6(4-) 

1 4 ( + ) 

1 4 ( + ) 

6( +) 

6( + ) 

1 

1 

1 
1 
1 



025 i j k 
026 ixO 

026 i j 1 
027i jx 
030 ijk 

*G30i0k 

*030i jO 

031 ijk 

*031 iOO 

*031 iOk 

*031 i jO 

032 ijk 

*033i0x 

*033 i JO 

033 i j 1 

034 ijk 
*034i jk 

035 ijk 
*035i jk 

036 ijk 
*036i jk 

037 ijk 

*037i jk 

040 i jkm 
041 i jkm 

042 ijk 

*042i jk 

*042i00 



Bjk 

Ai 

Ai 

Ai 

Ai 

Ai 

Ai 

Ai 

Ai 
Ai 
Ai 
Ai 
Ai 
Ai 
Ai 

Bjk, Ai 

Bjk, Ai 

,A0 

0, AO 

Tjk, Ai 

Tjk, Ai 

,A0 

0, AO 

Si 
Si 

Si 

Si 
Si 



Ai 
PSj 
QSj 
ZSj 

A j +Ak 
Ak 
Aj+1 
Aj -Ak 

-1 

-Ak 

Aj-1 

A j *Ak 

CI 

CA,Aj 

CE, Aj 

,A0 

0, AO 

Bjk, Ai 

Bjk, Ai 

, AO 

0, AO 

Tjk, Ai 

Tjk, A i 

exp 
exp 

<exp 

#>exp 
-1 



Pop/LZ 

Pop/LZ 

Pop/LZ 

A I nt Add 

A I nt Add 

A I nt Add 

A I nt Add 

A I nt Add 

A I nt Add 

A I nt Add 

A Int Mult 



Memory 
Memory 
Memory 
Memory 
Memory 
Memory 
Memory 
Memory 

S Logical 
S Logical 
S Logical 



Transmit (Ai) to Bjk 

Population count of CSj ) to Ai 

Pop count parity of (Sj ) to Ai 

Leading zero count of (Sj) to Ai 

Integer sum of (Aj) and ( Ak ) to Ai 

Transmit ( Ak ) to Ai 

Integer sum of (Aj) and 1 to Ai 

Integer difference of (Aj) less (Ak) 

to Ai 

Transmit -1 to Ai 

Transmit the negative of (Ak) to Ai 

Integer difference of (Aj ) less 1 to Aj 

Integer product of (Aj) and (Ak) to Ai 

Channel number to Ai (j=0) 

Address of channel (Aj) to Ai (j.NE.O) 

Error flag of channel (Aj) to Ai 

( j . NE. 0) 

Read (Ai) words to B register jk from 

(AO) 

Read (Ai) words to B register jk from 

(AO) 

Store (Ai) words at B register jk to 

(AO) 

Store (Ai) words at Be register jk to 

(AO) 

Read (Ai) words to T register jk from 

(AO) 

Read (Ai) words to T register jk from 

(AO) 

Store (Ai) words at T register jk to 

(AO) 

Store (Ai) words at T register jk to 

(AO) 

Transmit jkm to Si 

Transmit exp = 1's complement of jkm 

to Si 

Form 1's mask exp = 64-jk bits in Si 

from the right 

Form O's mask exp = jk bits in Si from 

the left 

Enter -1 into Si 



* Special CAL syntax form. 

+ The cycles needed = this number + 

till completion. 
x Field not used. 



(Ai ) 



Also, no issues allowed 
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Cy- 














c les 


CRAY-1 


CAL 


mnemon i cs 


Un it 


Descr ipt i on 




*042i77 


Si 


1 


S 


Log ical 


Enter 1 into Si 




043 i jk 


Si 


>exp 


s 


Log ical 


Form 1's mask exp = jk bits in Si 
from the left 




*Q43 i jk 


Si 


#<exp 


s 


Log ical 


Form 0's mask exp = 64 -jk bits in Si 
from the right 




*043i00 


Si 





s 


Log ical 


Clear Si 




044 i jk 


Si 


S j8.Sk 


s 


Log ical 


Logical product of (Sj) and (Sk) to Si 




*044 i j 


Si 


Sj&SB 


s 


Log i ca I 


Sign bit of (Sj) to Si 




*045 i jk 


Si 


*Sk&Sj 


s 


Log ical 


Logical product of (Sj) and 1's 
complement of (Sk) to Si 




*045i JO 


Si 


#SB&Sj 


s 


Log ical 


(Sj) with sign bit cleared to Si 




046 i jk 


Si 


Sj\Sk 


s 


Log i cal 


Logical difference of (Sj) and (Sk) 
to Si 




*046i JO 


Si 


Sj\SB 


s 


Log ical 


Togglo sign bit of Sj , then enter 
into Si 




*046i jO 


Si 


SBXSj 


s 


Log i cal 


Toggle sign bit of Sj , then enter 
into Si ( j . NE. 0) 




047 i jk 


Si 


ttS j \Sk 


s 


Log ica I 


Logical equivalence of (Sk) and (Sj) 
to Si 




*047 i Ok 


Si 


#Sk 


s 


Logical 


Transmit 1's complement of (Sk) to S i 




*047i jO 


Si 


ttSj\SB 


s 


Log i ca I 


Logical equivalence of (Sj) and sign 
bit to Si 




*047i00 


Si 


#SB 


s 


Log ical 


Enter 1's complement of sign bit 
into Si 




050i jk 


Si 


Sj ! SiSSk 


s 


Log ical 


Logical product of (Si) and (Sk) 
complement ©Red with logical product 
of (Sj ) and ( Sk ) to Si 




*050i jO 


Si 


S j ! S i SSB 


s 


Log ical 


Scalar merge of (Si) and sign bit of 
(Sj ) to Si 




051 i jk 


Si 


Sj !Sk 


s 


Log ical 


Logical sum of (Sj) and (Sk) to Si 




*051 iOk 


Si 


Sk 


s 


Log i cal 


Transmit ( Sk ) to Si 




*051 i JO 


Si 


Sj !SB 


s 


Log ical 


Logical sum of (Sj) and sign bit to Si 




*051 iOO 


Si 


SB 


s 


Log ical 


Enter sign bit into Si 


2 


052 i jk 


SO 


S i <exp 


s 


Shift 


Shift (Si) left exp = jk places to SO 


2 


053 i jk 


SO 


S i >exp 


s 


Shift 


Shift (Si) right exp = 64-jk places 
to SO 


2 


054 i jk 


Si 


S i <exp 


s 


Shift 


Shift (Si) left exp = jk places 


2 


055 i jk 


Si 


S i >exp 


s 


Shift 


Shift (Si) right exp = 64-jk places 


3 


056 i jk 


Si 


Si , Sj<Ak 


s 


Shift 


Shift (Si and Sj ) left (Ak) places 
to Si 


3 


*056 i j 


Si 


Si ,Sj <1 


s 


Shift 


Shift (Si and Sj ) left one place 
to Si 



1 
11 
II 
II 
II 
II 
1 



il 
II 

11 



Specie! CAL syntax form. 
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Cy- 














cles 


icRAY-1 


,'CAL 


mnemon ics 


! Unit 




1 Descr ipt i on 


3 

3 


! *Q56i0k 


!Si 


Si <Ak 


IS Sh 


ift 


IShift (Si) left (Ak) places to Si 


! 057 i j k 


IS j 


Sj ,Si:Ak 


IS Sh 


ift 


1 Shi ft (Sj and Si) right (Ak) places 














1 to Si 


3 


*057ij0 


jsi 


Sj ,Si>1 


IS Sh 


i ft 


IShift (Sj and Si) right one place 

i o Si 

IShift (Si) right (Ak) places to Si 


3 
3 
3 


*057i0fc 


Si 


Si>Ak 


IS Sh 


ft 


060 i j k 


! S i 


Sj i Sk 


IS I nt Add 


1 Integer sum of (Sj) and (Sk) to S i 


061 i j k 


S i 


Sj -Sk 


IS I nt Add 


I I nteger difference of (Sj) and (Sk) 














to Si 


3 
6 

6 


*061 iOk 


Si 


-Sk 


IS I nt Add 


Transmit negative of (Sk) to Si 


082 i j k 
*062i0k 


Si 

S i 


Sj+FSfc 

+ FSk 


F. P. 
F. P. 


Add 
Add 


Floating sum of (Sj) and (Sk) to Si 
Normalize (Sk) to Si 


6 


063 i j k 


s i 


Sj -FSk 


F. P. 


Add 


Floating d i f Terence of (Sj) and (Sk) 

to S i 

Transmit normalized negative of (Sk) 

to S i 

Floating product of (Sj) and (Sk) 


6 


*063i0k 


Si 


- FSK 


F.P. 


Add 


7 


064 ijk 


Si 


Sj *FSk 


F. P. 


Mult 














to Si 


7 


065i jk 


S i 


Sj *HSk 


F.P. 


Mult 


Half precision rounded floating 


7 












product of (Sj) and (Sk) to Si 


066 i jk 


Si 


Sj*RSk 


F.P. 


Mu It 


Full precision rounded floating 


7 












product of (3j) and (Sk) to Si 


067 i j k 


S i 


Sj*ISk 


F.P. 


Mult 


2 - Floating product of ( S j ) and (Sk) 














to Si 


1 4 


070i jx 


Si 


/HSj 


F. P. 


Rep 1. 


Floating reciprocal approximation of 
(Sj ) to Si 


2 


071 i Ok 


Si 


Ak 


- 




Transmit (Ak) to Si with no sign 
extens i on 


2 


071 i Ik 


Si 


+ Ak 


— 




Transmit (Ak) to Si with sign 
extens i on 


2 


071 i2k 


Si 


+ FAk 


- 




Transmit (Ak) to Si as unnormalized 


2 

2 : 
2 : 

2 ! 












floating point number 


071 i 3x 


S i 


0.6 


- 




Transmit constant 0.75*2**48 to Si 


071 i 4x 


S i 


o.4 : 


- 




Transmit constant 0.5 to Si 


071 i 5x I 


S i 


i . : 


- 




Transmit constant 1.0 to Si 


071 i Sx ! 


S i 


2. ! 


- 




Transmit constant 2.0 to Si 


2 ! 


071 i 7x ! 


Si 


4 , : 


- 




Transmit constant 4.0 to Si 


1 ! 


072 i xx ! 


Si 


RT ! 


- 




Transmit (RTC) to Si 


1 ! 


073 i xx ! 


Si 


vm : 


- 




Transmit (VM) to Si 


1 I 


074 ijk ! 


Si 


Tjk ! 


- 




Transmit (Tjk) to Si 


1 ! 


075 ijk ! 


Tjk 


Si 1 


- 




Transmit (Si) to Tjk 



E 
B 
B 

ii 



■ 

Ii 



n 
1 



15 
ii 



a 

si 



i 
■ 

ii 



* Special CAL syntax form. 
x Field not. used. 
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Cy- 


1 








cles 


CRAY-1 !CAI mn 


em on i cs 


Unit. 


Descr i pt i on 


5 


076 l j k ! S i 


V j , Ak 


- 


Transmit (Vj, element (Ak)) to Si 




077 ijk IVi.Ak 


Sj 


- 


Transmit (Sj) to Vi element (Ak) 




*077i0k !Vi,Ak 





- 


Clear Vi element ( Ak ) 


1 1 


1 Oh i j km ! A i 


exp. Ah 


Memory 


Read from ((Ah) + exp) to Ai (A0=0) 


1 1 


* 1 00 i j km ! A i 


expj 


Memory 


Read from (exp) to Ai 


1 1 


* 1 00 i j km ! A i 


exp, 


Memory 


Read from (exp) to Ai 


1 i 


*10ha 000! Ai 


,Ah 


Memory 


Read from (Ah) to Ai 





1 1 h i jkm ! exp, Ah 


Ai 


Memory 


Store (Ai) to (Ah) + exp (A0=0) 





*1 1 Oi jkm iexp, 


A i 


Memory 


Store (Ai) to exp 





*1 1 Oi jkm !e:cp, 


Ai 


Memory 


Store ( A i ) to exp 





*1 1 hi 000 ! ,Ah 


Ai 


Memory 


Store (Ai ) to (Ah) 


1 1 


1 2h i j km ! S i 


exp, Ah 


Memory 


Read from ((Ah) + exp) to Si (A0=0) 


1 1 


* J 20 i j km ! S i 


exp, 


Memory 


Read from exp to Si 


1 1 


* 1 20 i jkm !Si 


exp, 


Memory 


Read from exp to Si 


1 1 


*1 2hi000 !Si 


, Ah 


Memory 


Read from (Ah) to Si 





1 3h i jkm ! exp., Ah 


Si 


Memory 


Store (Si) to (Ah) + exp (A0=0) 





* 1 30 i jkm ! exp j 


Si 


Memory 


Store (Si) to exp 





*1 30 i jkm! exp. 


Si 


Memory 


Store (Si) to exp 





*1 Shi 000 1 ,Ah 


Si 


Memory 


Store (Si ) to (Ah) 


4 


1 40 i j k ! V i 


s j &vk 


V Logical 


Logical products of (Sj) and (Vk) 

to Vi 

Logical products of (Vj ) and (Vk) 

to Vi 

Logical sums of (Sj) and (Vk) to Vi 


4 


1 4 1 i j k ! V i 


V j SVk 


V Logical 


4 


142ijk !Vi 


S j ! Vk 


V Logical 


4 


*142i0k !Vi 


Vk 


V Logical 


Transmit ( Vk ) to Vi 


4 


143 ijk !Vi 


Vj !Vk 


V Logical 


Logical sums of (Vj) and (Vk) to Vi 


4 


144ijk !Vi 


SjWk 


V Logical 


differences of (Sj) and (Vk) to Vi 


4 


* 1 45 i i i ! Vi 





V Logical 


Clear Vi 


4 


145ijk jVi 


V j Wk 


V Logical 


Logical differences of (Vj) and (Vk) 

to Vi 

Transmit (Sj ) if VM bit = 1: 


4 


1 46 i j k ! V i 


S j ! VkSVM 


V Logical 










(Vk) if VM bit = to Vi 


4 


* 1 46 i Ok ! V i 


#VM&Vk 


V Logical 


Vector merge of (Vk) and to Vi 


4 


147ijk jVi 


V j ! VkSVM 


V Logical 


Transmit (Vj) if VM bit = 1; 
(Vk) if VM bit = to Vi 


6 


150ijk !Vi 


Vj<Ak 


V Shift 


Shift (Vj) left (Ak) places to Vi 


6 


*150ij0 !Vi 


Vj<1 


V Shift 


Shift (Vj) left one place to Vi 


6 


1 51 i jk !Vi 


Vj>Ak 


V Shift 


Shift (Vj) right. (Ak) places to Vi 


6 


* 1 51 i jO !Vi 


Vj>1 


V Shift 


Shift (Vj) right one place to Vi 


6 


152ijk IVi 


Vj , Vj<Ak 


V Shift 


Double shift (Vj) left (Ak) places 
to Vi 



il 
a 
n 

ii 
ii 
ii 
in 

IB 

ii 

ii 
II 
II 
il 
Ii 
Ii 
a 



Special CAI.. syntax form. 
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Cy- 
cles 


CRAY - 1 


CAL 


mrvnnon ios 


Un i t 




Descr ipt i on 


6 


*1 52 i jO 


Vi 


V j , V j < 1 


V 


Sh 


ft 


Double shift (Vj) loft one place 

to Vi 

Double shift ( V j ) right (Ak) places 

to Vi 

Double shift (Vi) right one place 

to V i 

Integer sums of (Sj) and ( Vk ) to Vi 


6 


1 53 i j k 


Vi 


Vj ,Vj>Ak 


V 


Sh 


ft 


6 


* 1 53 i j 


Vi 


V j , V j > 1 


V 


Sh 


ft 


5 


1 54 i jk 


Vi 


Sj+Vk 


V 


I nt Add 


5 


1 55 i j k 


Vi 


Vj *vk 


V 


I nt Add 


Integer sums of (Vj) sand (Vk) to Vi 


5 


1 56 i j k 


Vi 


Sj -Vk 


V 


I nt Add 


Integer differences of (Sj) and (Vk) 

to Vi 

Transmit negative of (Vk) to Vi 


5 


* 1 56 i Ok 


Vi 


-Vk 


V 


I nt. Add 


5 


1 57 i j k 


Vi 


Vj -Vk 


V 


I rv 


: Add 


Integer differences of (Vj) and (Vk) 

to Vi 

Floating products of (Sj) and (Vk) to Vi 


9 


1 60 i j k 


Vi 


Sj*FVk 


F 


P. 


Mult 


9 


161 i j k 


Vi 


Vj*FVk 


F 


P. 


Mult 


Floating products of ( V j ) and (Vk) to Vi 


9 


1 62 i j k 


Vi 


Sj*HVk 


F 


P. 


Mult 


Half precision rounded floating 
products of (Sj) and (Vk) to Vi 


9 


1 63 i j k 


Vi 


Vj*HVk 


F 


P. 


Mult 


Half precision rounded floating 
products of (Vj) and (Vk) to Vi 


9 


1 64 i j k 


Vi 


Sj *RVk 


F 


P. 


Mult 


Rounded floating products of (Sj ) and 
(Vk) to Vi 


9 


1 65 i jk 


Vi 


Vj*RVk 


F 


P. 


Mu 1 1 


Rounded floating products of (Vj) and 
(Vk) to Vi 


9 


1 66 i j k 


Vi 


S j * I Vk 


F 


P. 


Mult 


2 - floating products of (Sj) and 
(Vk) to Vi 


9 


167i jk 


Vi 


V j * I Vk 


F 


P. 


Mult 


2 - floating products of (Vj) and 
(Vk) to Vi 


8 


170i jk 


Vi 


Sj+FVk 


F 


P. 


Add 


Floating sums of (Sj) and (Vk) to Vi 


8 


* 1 70 i Ok 


Vi 


+ FVk 


F 


P. 


Add 


Normalize (Vk) to Vi 


8 


171 i j k 


Vi 


Vj+FVk 


F 


P. 


Add 


Floating sums of (Vj) and (Vk) to Vi 


8 


1 72 i j k 


Vi 


Sj -FVk 


F 


P. 


Add 


Floating differences of (Sj) and (Vk) 
to Vi 


8 


*172i0k 


Vi 


-FVk 


F 


P. 


Add 


Transmit normalized negatives of (Vk) 

to V i 

Floating differences of (Vj) and (Vk) 

to V i 

Floating reciprocal approximations of 


8 


173i jk 


Vi 


Vj -FVk 


F 


P. 


Add 


16 


174i jO 


Vi 


/HVj 


F 


P. 


Rep I 
















(Vj ) to Vi 


8 


174i j 1 


Vi 


PVj 


F 


P. 


Rep I 


Population counts of (Vj) to Vi 


8 


174i j2 


Vi 


QVj 


F 


P. 


Rep I 


Pop count parity of (Vj) to Vi 



Special CAL syntax form. 
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Cy- 
















c les 


CRAY-1 


CAL 


mnemon ics 


! Un it 


Descr ipt i on 




6 


175xj0 


VM 




Vj ,z 


IV Logical 


VM=1 where (Vj) = 




6 


1 75x i 1 


VM 




Vi,N 


! V Log i ca I 


VM=1 where (Vj).NE.O 




6 


1 75x.i 2 


VM 




Vj,P 


IV Logical 


VM=1 where (Vj) posit 


i ve 


6 


175xj3 


VM 




Vj , M 


IV Logical 


VM=1 where ( V j ) negat 


i ve 


9 


1 76 i xk 


Vi 




, AO, Ak 


! Memory 


Read (VL) words to Vi 
incremented by (Ak) 


from (AO) 


9 


*1 76ix0 


Vi 




, AO, 1 


1 Memory 


Read (VL) words to Vi 
i ncremented by 1 


from (AO) 





177xjk 


,AU 


Ak. 


Vj 


1 Memory 


Store (VL) words from 
incremented by (Ak) 


Vj to (AO) 





*1 77xj 


, AO 


1 


Vj 


1 Memory 


Store (VL) words from 
i ncremented by 1 


Vj to (AO) 



I 
I 

il 

I 

ill 

i 

ii 

m 



Special CAL syntax 
Field not used. 



form . 
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The Basic Details 

In general we have the Following scenario: in order to perform some 
alteration of the contents of one or more of the machine's registers or 
memory, an instruction must: first, wait to be brought into one of the 
instruction buffers; second, wait until prior instructions have started: 
third, wait till its operands are available; and fourth, wait until all 
shared components (such as pathsi along which information may flow, registers 
that mey be needed, and functional units that may be employed) will be 
available during the cycleCs) required. The CRAY-1 hardware maintains 
reservation tables, updated each cycle, for each register and all other 
shared components. It releases or issues an instruction only when it can be 
completed without interference from other previously issued instructions. 

Generally, timing analysis begins when the first instruction of interest ■ 

issues, but it is naive not to consider its placement in an instruction I 

buffer and the route by which it resc.hsd issuable condition. For many i 

algorithms, speed changes on the order of 10% occur depending on their ■ 

plaoei-isnt relative to the start of an instruction buffer. Details about the § 

instruction fetch mechanism Eire found in Appendix C. S 

All of the information used to decide about the issue of an instruction 
is contained in its 16 bits or, in the case of a 32-bit instruction, in its 
upper 16 bits. Normally the decision to issue can be made in one cycle. 
When an instruction issues, the components it will use fre reserved in the 

appropriate table for the appropriate time period. 

One type of 32-bit instruction, which makes a scalar memory reference, ■ 
is allowed to issue when all of the components it will need are available 
except possibly the appropriate memory bank. If the bank is available at the 

proper time, all proceeds normally. If not, completion of the instruction is a 

delayed and the next instruction requesting memory is not allowed to issue i 

until the previous one has obtained the proper memory access. Instructions § 

not requiring memesry, however, may proceed normally. | 

Until a specific instruction issues, the machine cannot look beyond it 
to determine that something further down in the instruction sequence could be 
done. It is the task, of the programmer and compiler to so order the 
computation that unnecessary delays are avoided. When you program in 
assembly language, it is important (end not difficult) to maintain an 
understanding of the resources of the machine called into play by each 
instruction and of the cycles in which they are used, in order to approach 
optimum utilization of the hardware, 

During the issue cycle, paths are opened so that information can flow 

from registers to functional units; during the completion cycle, paths are 
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required for information to flow from functional units to registers. Only 
one path is avail-able to service all results being returned to any of the 
eight S-registors. There is also one path for the A-registers. Possible 
conflicts over the use of these paths are resolved before an instruction is 
allowed to issue. A separate path into and out of each vector register is 
provided. Moreover, information arriving at any register in a given cycle 
may also bo redirected by a subsequent instruction, in that same cycle, to 
serve as input for another operation. That is, a subseauent instruction may 
issue on the same cycle in which its operands first become available. This 
redirection of information arriving at a vector register is called chaining, 
and it may begin only during the particular cycle when the first element of 
the result is returned from a functional unit. If two different functional 
units roturn their first results in the same cycle, a third instruction may 
chain from both of them. 

An exception to this "same cycle rule" occurs for conditional branch 

instructions, which require that their operand register becomes available 
somewhat before issue. 

Two Short Examples 

Let us consider what the hardware must take into account to decide when 
to issue a couple of typical instructions. 

First, a scalar floating point add: 62312, S3 S1+FS2. 

When the instruction sequence reaches such an instruction, the hardware 
checks its reservation tables to see that none of the following conditions 
are true: (1) the floating point add functional unit is busy (i.e., 
reserved) in this cycle, (2) register S3 is busy, (3) register SI is busy, 
(4) register S2 is busy, (5) a reservation exists for the S-register input 
path f> cycles hence. If any of these conditions are true, the instruction 
does not issue. In the next cycle (the machine having updated all its 
tables), the same conditions are tested. Eventually, all the needed 
components will be free and the instruction will issue. When it does, the 
tables will have: (1) a busy condition placed on S3 for 6 cycles (i.e., 
cycles 0, 1,2,3,4, and 5) and (.2.1 a reservation placed on the S-register input 
path 6 cycles hence (cycle 6). (No reservation is put on a functional unit 
by a scalar instruction, ) In the next cycle, the next instruction will be 
consi'le-ed for issue, nnd the components it needs will be checked for 
ava i l.;,b i I i ty . 

Now consider a vector instruction: 171312, V3 V1+FV2. 

When this floating point vector add is reached, the hardware checks its 
reservation tables for the following conditions: (1) floating point adder 
reserved, (2) vector register V3 busy, (3) V1 busy, and (4) V2 busy. It does 
not need to chock for path reservations since each V-register has its own 
path. When none of these conditions sire true, the instruction issues. When 
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it docs, (1) the tables have a busy condition placed on V1 and V2 for, 
max ( (VI.) , 5) cycles, where CVL) is the current value of the vector length 

reoisier (thus for short vectors a minimum reservation of 5 cycles occurs), 
(?) a busy is placed on the -floating point adder for (VI... ) +4 cycles, (3) a 
busy is placed on V3 for cycles 1 through 7 and cycles 9 through 
7+ma.< ( ( VI... ) , b) . Cycle 8 is the "chain" cycle. 

The Timing Chart 

We can keep track of important cycles by listing them in a timing chart. 
Then, when we want to consider whether a particular instruction can issue, we 
have the information at, hand. In practice, it is easier to list the cycles 
when t: component will rvsxt become ready for use than to record those in which 
it is busy. 

In such a chart, I and C refer to issue cycle and completion cycle for 
scalars, respectively, while I,C,0,F, and R refer to issue cycle, chain 
cycle, operand register ready cycle, functional unit available cycle, and 
result register ready cycle for vectors. 

Thus we have: ICO F R 

62312 S3 S1+FS2 6 

while, supposing the following instruction comes in sequence with the above 

and that. (VL) = 64: 

171312 V3 VH-FV2 1 9 65 69 73. 

The numbers recorded in the various columns represent the cycles in 
which certain important changes will occur as a result of the issue of the 

instruction in question. (Since for scalar instructions, the last three 
columns are not particularly informative, one may omit them.) Different 
type.'i of instructions tie up different machine resources for differing 
numbers of cycles, as indicated in Table 1. (See also Appendices A and D of 
the CRAY- 1 Hardwsre Manual. ) In the examples that follow, we will 
demonstrate the practical uses of these timing numbers. In general, the entry 
in the C column is the I number plus the appropriate instruction 
execut i on -complete time from the first column of Table 1. 

Preliminary Considerations 

Consider the first add mentioned above: 62312, with I = and C = 6. 
The 6 has two meanings. First, it is the cycle on which the result will be 
returned to S3 via the S-regi.ster output path. This means that this number 
cannot appear as the C cycle for any other (later issued) instruction whose 
result is destined for any S-register. For example, if the next instruction 
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were 76567 , transmit a V-register element to S5, which takes 5 cycles, then 
the machine must delay its issue. If you are recording the I and C numbers 
for a series of instructions, you should notice when you record two identical 
numbars in the C column. If the second is a result for the same set of 
registers as the first, it will be delayed, and you must adjust the issue 
cycle accordingly. 






I 


C 


70610 





14 


72600 


14 


15 



since the result of the clock read is not allowed to use S6 until the 
reciprocal is through with it. This assures that the result of the 
reciprocal will be overwritten by the later instruction. 

It is perhaps more common that a later instruction which would use the 
result of the reciprocal as an operand, would have to wait for it. Thus: 





I 


C 


70610 





1 4 


67561 


14 


21 



would be the timing for these two instructions. 



For vector instructions, the relations among the numbers I, C, 0, F, and R, 
are found as follows: When the issue time I becomes known, then C will be 
equal to I + the chain time for this instruction (the chain time being the 
functional unit time i- 2), will equal I +(VL), F = I+4+CVL) (thus F will 
normally be 0+4) (here, however, one exception exists, for vector store F = 
I+5-KVD), and finally R = C + (VL). For short vectors, where (VL) s A, C 
and F are as before, while 0=1+5 and R = C+5. 



Thus if (VL) 



we have : 



171312 




6 



R 

14 



All f i ve 
Tab lei), 



vector 
(VL) , 



timing numbers depend only on 
and i ssue ( I ) . 



the chain (C) cycle (from 
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Two Basic Examples and Comments 

In the two examples below, taken from (more or less) real programs, 
nearly all of -the main ideas surrounding accurate timing of code are 
mentioned. Examine the instruction sequence and refer to the notes for an 
explanation of the timing numbers listed. 

Example 1 

First, we consider the earlier example, ABC: 

1 * CFT I =ABCSF,ON=G J L=LSF,B=BSF 

2 * LDR I = (BSF,BABCS) ,ML=MSF,X=XBS,ORDER=CLNB, FIRST=BABCS 

3 * XBS 

4 COMMON /ABCOMMON/ A ( 56 ) , OUTRANGE, B( 56) , C( 56) 

5 CALL L I NK ('UNI 1 59 = TERM I HAL// ' ) 

6 Y = X'*X*X*X*X*X*Y*X 

7 Dfl 1 I = 1,169 

8 1 ACS) = I 

9 OUTRANGE = 6000040000000000000Q0B 

10 M = IRTCCO) 

1 1 CALL ABCS 

12 N = IRTCCO) 

13 X = N-M 

14 WRITE (59, 59) C,X 

15 59 FORMAT ( 7F6. 0) 
i 6 STOP 

1 7 END 

ABC consists of a FORTRAN part, ABCSF (MAIN.), where the RTC is read, 
and a CAL part ABCS, where adds are done. We note that we are timing the 

case where the assembly portion is loaded first. 

Listed below is the set of six assembly instructions generated by CFT 
for the portion of the code where the RTC read occurs (extracted from the 
long listing). The address listed is after the load. Recall that I and C 
refer to the machine cycle on which instruction issue and completion, 
respectively, occur (see Table 1). (The small letters refer to notes 
f o I low i ng . ) 
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Address 


Mac 


h i ne code 
( octa I ) 


Mnemon i cs 
( dec i ma 1 ) 




I 


C 


Comment 




251a 

251b 
251d 
252a 

252c 
253a 


072300 
130300 
022700 
007000 

120100 
072700 


000225 
001000 

000225 


S3 

M, 

A7 

R 

SI 
S7 


RT 
S3 


ABCS 

M, 
RT 


Oe 

ig 
3i 

4 

1657m 
1659 


If 

-h 

4j 

19k 

1668n 
1660 


Read RTC 

Save RTC 

Arg count 

Call subroutine 

Get saved RTC 
Read new RTC 


11 

ii 
II 
11 
11 

a 
m 



Notes: (a,b,c,d at the left refer to the parcel address in the word where 
•the instruction is located.) 

e. Assume all resources of the machine are available, initially. 

f. A "72" instruction requires one cycle to complete after issue (see 
Table 1). If any previously issued instruction had needed to put a 
result into any S- register during cycle 1, the issue of this 
instruction would have; to be delayed by the machine. 

g. The instruction following a 16-bit instruction may issue on the next 
cycle (if there is no conflict, as is the case here), S3 being now 
ava i I ab I e . 

h. A store instruction uses an S or A register only during the issue 

cycle. The result actually reaches memory several cycles later, but 
for purposes of subsequent fetch instructions, vector loads, or 
memory busy conditions, the memory is essentially free after four 
cycles, while the register itself remains free. 

i. The instruction following a 32 -bit instruction may not issue until 
after a delay of one cycle (to bypass the lower 16 bits). 

j . A "22" instruction requires one cycle to complete after issue. If a 
previously issued instruction needed to put a result into any 
A-register during cycle A, this issue would be dalayed. (But an 
S-reg result could complete then without delaying this.) 

k. This instruction, which would normally complete at cycle 18, is 

delayed for one cycle by memory busy from the previous store, since £ 
memory-busy condition is not allowed when starting the fetch of the 
next 16-word buffer- load of instructions. If this "007" instruction 
addressed an instruction from code already in a buffer, it would 
complete at cycle 9. In the case of a jump instruction, completion 
means that the jumped-to instruction may issue. 

m. This fetch instruction cannot issue until the called subroutine 
returns to it. See the analysis of ASCS below. 

n. When it does issue it will require 11 cycles for the contents of 

memory to reach the S register. The memory bank will be free after 
only four cycles. 

Now consider the CAL portion of our example, called by the FORTRAN 

portion above. 
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CAL I =ABCS, 8 = 6.0305, L = LSC 



71 

70 
70 



022100 

0??263 

1211 OOOO0OOOC 

1212 000 50071 C 
06231 2 

1313 00000161C 

0301 10 

031 012 

011 00000^00c+ 

0G5000 



A 
B 
C 



ABCS 
LOOP 



I DENT 

COMMON 

BSS 

BSS 

BSS 

BLOCK 

ENTRY 

A1 
A2 
SI 
S2 
S3 
C,A1 
A1 
AO 
JAN 
J 
END 



ABCS 
AB COMMON 
57 
56 
56 

ABCS 
ABCS 


51 

A, A1 

B, A1 
S1+FS2 
S3 

A1 +1 
A1 -A2 
LOOP 
BOO 



ii i nee the 
;ons icier 



instructions here form a 
letr, more then once . Tho 



loop to be pel-formed 51 
instructions for pass 1 



t lines, 
are : 



Ack!r 



Machine code 
( octal ) 



Mnemon ics 
( dec i ma I ) 



2Q0a 
20 Ob 
200c 
201a 
20 'ic 
201d 
202b 
202c 
202d 
( 203b 



022100 
022263 
121 100 
121 200 
062312 
1 31300 
0301 10 
03101 2 
01 1000 
005000) 



02551 1 

025602 

025662 
001002 



A1 





19k 


20 


A2 


51 


20 


21 


SI 


A, A1 


21 


32 


S2 


B,A1 


23 


34 


S3 


SI +FS2 


34p 


40q 


C,A1 


S3 


40 


-r 


A1 


A1 +1 


42 


44s 


AO 


A1 -A2 


44 


46 


JAN 


LOOP 


48t 


53u 


(J 


BOO) 


(50 


57 )v 



Notes for pass 1 : 

k. See previous note k. 

p. The issue of the add instruction is delayed until both operands (SI and 

S2) have arrived from meir.ory . The completion cycle of the S2 fetch is 
the start cycle of the add. 



q. 
r . 



A floating point add requires six cycles to complete (from Table 1). 

Normal lv, we don't need to consider memory. S3 is available to start the 
store at cycle 40, and remains available for other use in the next cycle. 



u . 

V . 



An address add requires two cycles. (So does an A to A move, which is 
really an add of 0. ) 

A conditional jump instruction doss not issue until two cycles after the 
ncjeded operand becomes available. (AO is returned at 46; 47 is skipped; 
48 is issue.) Other instructions, even one using AO (but not putting a 
result into AO) could issue at 47, and the jump would still go at 48. 

This ln-stacfc branch (to 200c) requires five cycles. 

The numbers here refer to the cycles on which this instruction would have 
issued and completed, if the program did not branch back. 

The instructions and timing for passes 2 and 51 are as follows 



Address 


Mach I ne 
( octa 


code 
I ) 


Mnemon i cs 

( dec i ma I ) I 


C 


Pass 2 












200c 


121 100 


02551 1 


SI 
(add 


A,A1 53u 
32 to Pass 1 numbers) 


64 


202d 


01 1 000 


001002 


JAN 


LOOP 70 


75 


(203b 






J BOO 72 


79)v 


Pass 5 1 






(add 


1600 to Pass 1. number) 




202d 


01 1000 


001 002 


JAN 


LOOP 1648 


1653 


203b 


005000 




J 


BOO 1650W 


1657x 


252c 


120100 


000225 


S1 


M, 1657 


1668 


253a 


072700 




S7 


RT 1659 


1660 


253b 


120200 


000225 


S2 


M, 1660 


1672y 


253d 


120300 


000225 


S3 


M, 1663y 


1676 



Notes for Passes 2 through 51 



u. The in-stack branch completes and this instruction issues during cycle 

53 . 
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Once again, these are the "if it didn't" times. 

This time it doesn't. 

The return jump requires only seven cycles to complete because the code 
that called this routine is still in a buffer. 

Consecutive scalar loads (or stores) may issue as few as 2 cycles apart 

and, if thov do not address the same memory bank, finish in 11 additional 
cycles. If'the second does address the same bank, it will require one or 
two extra cycles to finish, and a third consecutive scalar load (or 
store) will be delayed from issue until memory is free (at most four 
cycles later ) . 

In general, a scalar load or store that encounters a memory conflict 
(which could come from I/O), issues as usual. This allows subsequent 
nonmemory instructions to proceed normally, while delaying memory 
instructions until the ccnflict is resolved. On the other hand, vector 
lo-ids or stores (and instruction-buffer loading) wait until memory is 
entirely free before issuing (or starting). Such delays usually last no 
more than two cycles. 



eft 



Thus, given the task, of writing an efficient scalar loop to compute 

C =A«-B, we can trv a few alternate ways to do it, timing each one as we go, 
until we have identified the one with the lowest last- issue cycle. 

For example, changing the three lines 

C,A1 S3 
Al A1+1 




to 



AO A1-A2 

A 1 A 1 + 1 
AO A1-A2 
C-1,A1 S3 

would cut six cycles from the loop time and thus result in nearly a 20% 
saving in the measured execution time, (26 rather than 32 cycles per loop). 

While it is actually possible to accomplish this loop by a scalar method 
in 14 cycles per pass, the parallel, nonrecursive nature of the loop allows a 
much cirsrrter saving by using vector instructions. So, let us now consider 
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code ABCV, and list its 
for i.his compilation. 



t im i ng deta i Is . 



For an alternate view, we use CIVIC 



Example 2 



000000A 
000001 D 
000002C 

0000 1 0D 
00001 ID 
0000120 
00001 3B 
00001 4B 
00001 4A 

000037C 



ABCVF CVF BVF 

KBABCV.BVF) , ML 



LVF P24 L 
=MVF.X=XVF 



S3 



CIVIC 
LDR I 
XV F 

COMMON /ABCCSMMON/ A ( 56 ), OUTRANGE, B( 56 ) 

CALL. LINKC ' UNI Tg9 = TERMI NAL// ' ) 

DO 1 I = 1 , 169 

A ( I ) = I 

OUTRANGE = 600004000000000000000B 

M = QSRTC(O) 

CALL ABCV 

N = Q8RTC(0) 

X = N-M 

WRITE(59 59) C,X 

FORMAT (7F 6.0) 

STOP 



C(56) 



END 





* 


CAL 


I =ABCV. E = 
I DENT 
COMMON 


= X00 J B = BAB0V J L 
ABCV 
ABCOMMON 


71 


A 




BSS 


57 


70 


B 




BSS 


56 


70 


C 




BSS 

BLOCK 

ENTRY 


56 

ABCV 
ABCV 


022363 


ABCV 


A3 


51 


0200 00000000C 






AO 


A 


002003 






VL 


A3 


176100 






VI 


j AO, 1 


0200 00000071 C 






AO 


B 


176200 






V2 


,A0, 1 


171312 






V3 


VI +FV2 


0200 000001 61 C 






AO 


C 


1 77030 






,A0, 1 


V3 


005000 






J 
END 


BOO 



= LVC 



Again, we consider the code from one read RTC to the next. Note that 
since this particular set of adds is not more than 64 in length, it can be 

done without looping instructions. 

We will now record the full five columns of numbers. The I, C, Q, F, 
and R refer to issue cycle, chain cycle for vector instructions (or 
completion cycle for scalers), operand register(s) free cycle, functional 
unit free cycle, and result register free cycle, respectively. 
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Mach i 


ne 


code 


Mnemon ics 












Address 


( octa I ) 


(dec 


imal ) 


I 


C 





F 


R 


5013d 


072300 






S3 


RT 





1 








5014a 


1 30300 




005053 


M, 


S3 


1 


5 








5014c 


022700 






A7 





3 


4 








501 4d 


007000 




024000 


R 


ABCV 


4 


9e 








5000a 


022363 






A3 


51 


9 


10 








5000b 


020000 




000200 


A0 


A 


10 


1 1 








5000d 


002003 






VL 


A3 


12 


13 








5001a 


176100 






V1 


^0, 1 


13f 


22g 


-h 


68 i 


73j 


5001b 


020000 




000271 


A0 


B 


14k 


15 








5001d 


176200 






V2 


,AO, 1 


681 


77 


- 


123 


128 


5002a 


17131 2 






V3 


VI + FV2 


77m 


85n 


128o 


132n 


136n 


5002b 


020000 




000361 


AO 


C 


78 


79 








5002d 


177030 






,A0, 


1 V3 


136p 


-q 


187r 


192s 


- 


5003a 


005000 






J 


BOO 


137 


144 








E"015b 


0721 00 






SI 


RT 


1 44t 


145 








501 5c 


130^00 




005054 


N, 


S1 


192u 


- 









Notes 

(a. b, o, and d are parcel addresses, after the load, as before. ) 

e. For this compilation the destination of the return jump is already loaded 
into a buffer, so the branch instruction executes in only five cycles. 

f. To begin execution, this vector instruction needs A0 and VL to be ready, 
VI to be free, and memory to be free. Since they are, it issues. 

g. The first, result will be arriving from memory nine cycles after the issue 
cycle. This cycle (cycle 22) is the chain cycle for this memory load. 
(More on chaining in note m.) 

h. When this instruction issues (cycle 13) it transmits as operands the 

contents of the VL register, the special value 1, and register A0 to the 
memory functional unit.. (Some vector memory lofids use a second 
A-register for the increment. ) All these scalar transmissions occur 
during the issue cycle and are held by the functional unit thereafter. 
[When A0 and SO are used sis special values their reservation is not 
chocked, and so th-";y do not delay issue. Here, however, A0 is also used 
to hold an arldress, and if it had not been free when needed, the issue 
would be delayed.! For a vector load instruction, no vector register is 
used as input, so no entry is made in column 0. 



For thi 

scalar 



5 instruction, the 
nemory references , 



functional unit involved is memory. As with 

a memory bank will be busy for four cycles with 
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each word read. If the vector load moves through at least three other 
banks before returning to a previous one (as is the case here), no 
conflicts will arise, and a new word will be read each cycle. The first 
word is requested at cycle 13 and the 51st at cycle 63. The memory will 
be busy for 4 more cycles, through cycle 67, and free for another memory 
reference in the next cycle. We record 68 = 13+51+4 under the functional 
unit free column. Notice that memory is free five cycles before register 
VI is ready. 

When this instruction issues (cycle 13), it puts a hold, or reserve, on 
register VI in order to keep it available for the words coming in from 
memory. The reserve will be lifted after the last word arrives. Since 
the (VL ) is 51, the last (51st) word will arrive in cycle 72. (The first 
arrives in cycle 22. ) In the next cycle the V1 register may be used for 
another purpose; therefore we record 73 = 22+51 under the result register 
free column. The CRAY hardware has one element pointer for each 
V-reg i ster , and it is used to select one of the 64 positions in the 
V-reoister. Tho pointer for register VI is automatically stepped from 1 
through 51 during cycles 22 through 72. 

Since the previous vector instruction road out AO and (VL), saving them 

in the functional unit at the start of the vector load, subsequent 
instructions may modify them immediately without affecting the previous 
i nstruct i on . 

Here a major delay is encountered. This instruction also transmits words 
from memory to a V-register. The register is available but the memory is 
busy, so issue is delayed till it is free (in cycle 68). 

This instruction chains. At cycle 69, it is first considered for issue. 
However, before it can begin executing, this vector add needs to have the 
vector length register, register VI, register V2, the floating point add 
functional unit, and register V3 free. V1 , as noted, becomes free at 
cycle 73; V2 will not be free until 128; but the first element will 
arrive at cycle 77 and during that one cycle, it can be redirected, or 
chained, to serve as input to the add unit as well as being put into V3 . 
The conditions for chaining are thus satisfied during cycle 77, and so 
the instruction issues. 

The first result exits from the floating point adder eight cycles after 
the first operands were sent over. For this instruction, then, its chain 
cycle is 85 = /7+8. Similarly its result register (V3) free cycle is 
136 = 85+51, and its functional unit free cycle is 132 = 77+51+4. The 
four extra cycles here are equivalent to the four extra cycles needed for 
memory free by the memory functional unit. All functional units remain 
reserved for four extra cycles after the last element arrives during 
vector instructions. This means that a subsequent scalar (or vector) 
floating point add cannot is.«ue until cycle 132, since it shares this 
un it . 

Since thir- instruction requires that vector register operands be sent to 
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the adder For the next 51 cycles, a reserve is placed on registers V1 and 
V2 until cycle 128, at which time thsy will both foe free and able to be 

used by & subsequent operation. 

p. This vector store does not chain from the add. In the first place, at 

cycle 85, the chain cycle for V3, the memory is busy completing the load 
of V2. In the second place, store instructions are barred by the 
hardware from chaining even if the memory functional unit is free. The 
store does/i 't begin at cycle 123 (when the Memory becomes free) either. 
It can't issue at 123 because the element pointer for V3 is not pointing 
to V3's first element, which the store needs, but rather at element 39, 
which is boing returned by the floating-point adder. It finally issues 
when register V3 is not otherwise busy and can have its element pointer 
reset, namely cycle 136, the result register free cycle for the earlier 
add . 

q. A store doesn't chain to anything, either. 

r. Register V3 will be Free after the store at cycle 187 = 136+51. 

s. Finally, the memory functional unit will become free from the store five 
cycles after the operand register, V3, is free. All other instructions 
free their functional units four cycles after their operand registers but 

store requires one extra cycle. 

t. Since the return from subroutine did not require memory, as the address 

is already in a buffer, the next instruction, which for CIVIC is the read 
of the RTC, gets issued well before the vector store completes. 

u. Finally, we note that the final store of the RTC value to memory is 

delayed by the memory busy condition from the vector store, and issues 
when" the memory functional unit ready cycle occurs. 

Cone I us i ons 



It should be clear from the timing chart above that the CRAY-1 is not 
really very busy during this vector add routine. For example at cycle 78, 
its busiest cycle, V- registers 0,4,5,6, and 7 are free along with the shift, 
fixed add, multiply, reciprocal, and logical functional units. Moreover, the 
next 55 cycles (as well as most of the previous 60) could be used to issue 
independent instructions for a related calculation, if one needed to be done. 
(In feet, we can actually decrease the time for ABCV by four cycles by using 
some of the idle resources. ) 

Frequently, parallel use of available resources can be made, especially 

in the rass of vector loops. Three examples of actual code are presented in 
Section V to show this: ZVSE-EK, QVDIVO, and QVSQRTH . 
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IV. THE COMPUTER CODE CYCLES 

CYCLES is a public File on the CRAY - 1 computers at LLNL. It was written ■ 

by Roll in Harding. A Fortran version of it has been made available to Cray ■ 

Research Incorporated and is being modified for use under their system. ' ■ 

■ 

CYCLES is not a simulator and does not have knowledge of the values in ■ 

all machine registers. It does, however, try to keep track of the values in ■ 

the VL. and A registers. Options allow these register values to be specified ■ 

for- CYCLES' use. " 1 

The rest of this section is taken from the documentation for CYCLES. A I 

full writeup, CYCLEWUP, can be extracted from the CYCLES public file using ■ 

LIB. g 

Cycles Writeup i 

12 

a 

CYCI ES was designed for detailed analysis of instruction scheduling in ■ 

compiled or- ^s r 5eiribled CRAY codes. The timing analysis is presented in the 1 

spirit of Harry Nelson's report, UC ID- 301 79, Rev. 1, "Timing Codes on the ■ 

CRAY-1". Harry supplied additional timing details and tested the code ■ 

extensively during the debugging period. 9 

n 

Input to CYCLES is any HSP file from CAL, CIVIC, CFT, or DDT which I 

contains the machine code listing. CYCLES accepts single or double column ■ 

listings from CIVIC (M or L option) and the four instructions per line format 1 

from CFT (on=g). Sequences of octal parcels may be entered from TTY or by I 

specifying octil word limits in a control lee or other binary file. In TTY or 1 

binary modes CYCI ES adds the equivalent CRAY assembly language instructions S 

to the output, i.e. does a CRAY UNDO. CYCLES will also accept the history ■ 

file produced by DDT in the l"IN£ output format mode. This form has the a 

advantage of using correct symbols for variables in the program being undone. I 

H 

Output consists of a copy of the input file with up to seven columns of ■ 

timing information added for each machine instruction line. (This overwrites ■ 

the comment f'eld in CAL listings.) The NOCOPY . option will suppress most ■ 

non-instruction lines from being output. The seven timing columns are: 1 

a 

W number of cycles this instruction waited to issue ■ 

D octal codes identifying any delays a 

I issue cycle for the current instruction 1 

C vector chain cycle or scalar completion cycle I 

vector operand register ready time 3 

F vector functional unit ready time g 

R vector result ready time a 
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The I column is always given; others are suppressed if null or B 

irrelevant; for the current instruction. Alternate definitions for columns C, B 

0, F, and K for jump instructions tare given below. g 

■ 

CYCLES is very fast and is easily run as a controllee under TRIXGL. ■ 

output can bo viewed without line wraparound by using TUBE command S or i 

TRIX3L command Tv,1 for small characters. Effects of altering instruction ■ 

sequences can be checked easily by rearranging tines in CYCLES' infile and I 

rerunning it without reassenbl i nc> your code. One may also rearrange lines in ■ 

C/CLbi"' outfile and then use that as the infile. CYCLES CIVIC output is 1 

compatible with single column CIVIC output. CYCIES' CFT, CAL, and binary ■ 

output are compatible with CAL output. CYCLES' DDT output is compatible with H 

DDT output . a 

■ 

Abilities and limitations g 

U 

i 

CYCLES is aware of most of the fine points of CRAY instruction ■ 

schedul i ng : g 

■ 

- chaining requirements g 

- recursive vector operations g 

- no waits for special AO and SO operands g 

- memory functional unit requirements ■ 

- vector memory conflicts due to 8*n increments ■ 

- A and S register trunk conflicts. H 

- extra delay after AO or SO ready for conditional jumps. ■ 

- scalar memory bank conflicts (with limitations) ■ 

- instruction buffer fetches, conflicts, and delays. g 

- other special cases g 

■ 

CYCLES has to make assumptions about loader dependent conditions such as ■ 

instruction buffer delays and scalar memory bank conflicts. Bank conflicts I 

may not be detected if memory addresses are indefinite. Addresses are I 

indefinite if they involve undefined A register values or unspecified ■ 

relocation flags. Options are provided to specify that the current code 1 

block (local relocation) is loaded on a 20b-worcl buffer boundary or that all I 

external blocks (subroutines or commons) are loaded on 20b -word boundaries. I 

The relevant option rames are +.., x., and + x . to set relocation flags, and I 

rlEOH-. to turn off bank conflict checking. IBOf'F. turns off instruction ■ 

b uffer checking, a 

g 

VL and A reg i ster s g 

■ 

m 

Many instruction timings depend on values of the vector length register 1 

and A registers. CYCLES attempts to keep VL and A regs current as " B 

instructions are processed that set those registers. A registers set from 1 

memory or from S registers are considered indefinite. Results of A register ■ 

calculations involving indefinites are also indefinite. VL will be set to 64 ■ 

if it is set from an indefinite A register. Reg ' ster changes are reported in ■ 
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•the output, Automatic register setting can be disabled by the NOVLA . ■ 

exoeuT, e line option. g 

You may explicitly reset values for VL, A, or NI (next issue) by i 

inserting control lines into CYCLES' input -File or as comments in a CAL ■ 

source file. In column 1 of the input file use Ln to set VL to n (decimal), ■ 

use Cn to reset counters and force the next issue to cycle n (decimal), and 1 

use An,m to set register An to in (decimal). CAL comments »Ln, *Cn, and *An,m i 

would have the same effects, g 

■ 

Jump instructions §f 

m 

For conditional jumps, CYCLES assumes drop through timing. Normally, 1 

the cycle counter is reset to zero after each unconditional jump. However, B 

if the following instruction is recognized (by its address) as the target B 

instruction, then timing continues without reset, This can be accomplished ■ 

by control cards (CYCLE OFF/ I N/OUT or REPEATn described below) or by ■ 

rearranging the input file. g 

For a jump instruction certain columns are redefined: i 

C Earliest issue for the jump target if the jump is taken ■ 

Target instruction buffer code (see I -buff section) ■ 

F Target issue time for an In -buffer jump H 

R Target, issue time for an out -of -buff er jump g 

I 

An out-of -buffer jump can be significantly delayed if memory is busy, i 

for instance, completing a vector store. | 

H 

You can control the output for a jump to a later instruction by B 

inserting a control line CYCLE OFF immediately after the jump and a CYCLE IN ■ 

or CYCLE. OUT line immediately before the target instruction. CYCLES will B 

stop timing after the PhF and will resume by issuing the target instruction II 

at the proper rN buffer or OUT of buffer issue time. Comments, *CYCLE OFF, B 

etc., can be used in a CAL source as well. a 

EH 

A REPEATn line can be used for continuous timing over a jump to an ■ 

earlier instruction. The REPEAT line is inserted immediately before the 9 

tar-got instruction. From then on, each jump instruction is checked to see if S 

its -target has an aci i ve repeat line. If it does, the count n is ■ 

decremented, and timing continues at the target line using the in buffer time ■ 

plus any appropriate delays for registers or functional units. Up to ten ■ 

repeat lines may be active at any time. Repeats may be nested. ■ 

13 

Instruction buffer (I -buff) delays H 

■ 

The CRAY has 4 instruction buffers. They are loaded in rotation. Each ■ 

holds 20b words (6S parcels) of instructions. I -buff delays occur each time ■ 

execution shifts from one buffer to another due to a jump instruction or B 
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simply when crossing from one 20b block to the next. Additional delays B 

result when memory operations conflict with instruction fetches or when a H 

two-parcel instruction straddles a buffer boundary. For I -buff checking B 

CYCLtfe. assumes that relative word zero is loaded on a 20b-word boundary. ■ 

H 

I -buff delays are indicated in the usual way, using delay code 200b, but I 

additional information is also given: ' gj 

_ m 

■ The first instruction from a buffer is marked (between the W and D columns) ■ 

by a lestter a,b,c, or d for buffer 0,1,2, or 3. Upper case means the | 

instructions were fetched from memory; I ower case means the buffer was ■ 

a I ready I cacled . g 

H For jump instructions the target instruction buffer is given under the 1 

column. Again, upper case is out-of -buf f er ; lower case is in-buffer. A I 

jump to an external (x reloc) address is alwavs considered out-of -buf fer I 

An unconditional jump out -of -buf fer clears one instruction buffer unless B 

the NOICLR. option is used. A Bn line can be used to clear n additional ■ 

instruction buffers. g 

■ 

a Delay code 10000b shows that an instruction fetch was delayed because H 

memory was busy. Because of look-ahead, this does not cause an immediate 1 

delay of issue, but it does signal a possible delay for a subsequent issue ■ 

(usually the target instruction of an out-of -buf fer jump appearing in i 

co lumn F) . g 

■ 

1 Delay code 20000b indicates that the parcel address for the current I 

instruction was not in a current I -buf f or one that had been fetched No ■ 

delay is assessed. g 

B Delay code 40000b indicates the possibility of a delay that this version of ■ 

CYCLES couldn't determine. The marked instruction is oarce I 17c of the ■ 

current instruction buffer. If the next instruction ("l7d) happens to be a ■ 

two-pares I instruction (this is what the timing subroutine didn't know) ■ 

then 17c would be delayed until one cycle before the issue time indicated i 

on the next line for 1 7d . This delay of parcel 17c could cause further ■ 

dsl-sys not shown for 17d, 20b, or later instructions. Correct timing can ■ 

be produced in the current version by inserting an "In" control card before S 

17c, where n (decimal) is the correct issue time for 17c. ■ 

Availability of CYCLES J 

i 

H 

The latest version of CYCLES is maintained in CRAY public file CYCLES. ■ 

The HELP packages are reproduced below. The output file is named Hinfile and 1 

is left on disk. An existing file will be overwritten. If the file 1 

overflows, sequence numbers will be added: 00, etc. a 

This writeup is available as CYCLEWUP in public file CYCLES. 1 1 w i I I be ■ 

revised as suggest, i ons are made or changes made to CYCLES. The revision date H 

is given on line 1. g 
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■ 

Please send suggest i ons for enhancements to CYCLES or listings of any ■ 

bugs you encounter to Roll in Hard i ng in A-Division (L-16). " " I 

CYCLES HELP: g 

■ 

execute lines: g 

cycles hspf ile type <nocopy. novla. ... noiclr. &>/tv ■ 

cycles tty / t v p 

cycles binfile fwa Iwa <abs. end> / t v (binary input mode) i 

type is cal eft civic or ddt ■ 

<> shows options, keep in order, no comma for dropouts. H 

nocopy. suppresses non-instruction lines i 

novla. defeats automatic setting of vl and a registers ■ 

mboff. suppresses mem bank conflict checking ■ 

+ x. assumes both +. and x. (increases mem bank checking) i 

+. assumes present routine is loaded on a 20b boundary 1 

x. assutnos externals are loaded on 20b- word boundaries i 

+xreloc. oct sets both +reloc. and xreloc. (affects i -buf f chks) 1 

+re!oc. oct =>of f set=oct for local word in i -buf f and mem bank 1 

xreloc. oct =>offset=oct for external reloc vars and subrs ■ 

iboff. suppresses instruction buffer checking ■ 

noiclr. suppress clearing an i -buf f after out -buf uncond j mp ■ 

& to continue execute line B 

fwa, Iwa are octal; may have a, b, ; pa, pb, etc . parcel tags B 

abs . changes assumed 3400b minus word offset to B 

end says don't ask for additional fwa Iwa pairs 1 

outfile name will be h+ infile name 1 

type delayed for list of deltsy codes i 

type helpcc for list of infile control card options ■ 



HEL.PCC 
i n 



■ 

col 1 of cycles' input file (cal , c i v ic, cf t , ddt ) use: i 

In to set vector length to n (decimal) B 

<^n to '-eset registers and set next issue time to n (decimal) ■ 

bn to clear n additional ins.ruction buffers 1 

in to set next issue to n (dec) without resetting registers ■ 

am,n to set. register am to value n (decimal) ~ i 

repeat n before target instr to time n jumps; back to target ■ 

cycle off disable cycle counting, use after conditional jump H 

cycle on resume counting at the ' in buffer' jump time g 

cycle in same as cycle on 1 

cycle out resume counting at the 'out of buffer' jump time 1 

use any of these as comments in your cal infile: *am, n etc. I 

in TTY mode use In en in an,m as above, and use ■ 

ploci to sat parcel to word 'loc' and parcel i=a,b,c,d,pa, i 
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Tab I ra of Delay Codes 

DELAYCD: ■ 

■ 

octal delay codes: ■ 

lb functional unit not ready ■ 

2b result register not ready ■ 

4b operand register not ready B 

10b waiting for chain cycle ■ 

20b a or s register trunk conflict ■ 

40b scalar memory operation bank conflict ■ 

100b conditional jump delayed by aO or sO busy last 2 cycles 1 

200b instruction buffer do 1 ay ■ 

400b operand chain eyeless don't, match, can't chain. ■ 

1000b mis-vd chain .rycls ■ 

2000b waits for all instructions to complete i 

4000b waiting for register block, transfer to finish 1 

1 0000b instruction fetch delayed by memory busy B 

20000b current instr in unexpected buffer, no delay added ■ 

40000b possible two parcel split delay of 1 7c ■ 
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EXAMPLES 



! As mentioned above, source code for all these examples is in public LIB 
! file CLASS on the CRAY-1. The timing numbers ars from code CYCLES. 

ZVSEEK 

ZVSEEK is a BASEL IB function designed to find a target value i n an 
unordered list. The original version was written about a year before the 
LLNL machine arrived and has since been upgraded by use of timing analysis to 
run more than twice as fast. Most of the speed increase was obtained through 
a simple algorithm change: replacement of a logical vector instruction by a 
fixed add. However, an additional healthy gain came through improved 
handling of the vector looping technique. The main loop of the original 
rout i ne cons i sts of 1 i nstruct i ons . 

Meiin Loop of ZVSEEK (Old Version). 



This version prestores the target, at the end of the search array, so 
that it must eventually exit on a hit. 

Timing of original version: VL = 64. 

Comment 

Get next 64 values 

XQR each with target 

Check for hit 

VM to S for count 

VM to S for test 

Count left zeroes 

( needed if hit) 

Exit if hit 

. LOC. of next 64 values 

Up A5 by 64 

Go check next 64 values 



Addr oss 


I nst 


ruct i on 


I 


C 





F 


R 


L64 


VO 


,A0, 1 





9 


- 


63 


73 




VI 


S4W0 


9 


13 


73 


77 


77 




VM 


VI , z 


77a 


-b 


141 


1 45 


147c 




31 


VM 


147c 


148 










SO 


VM 


148 


149 










A4 


ZS1 


149 


152 










JSN 


HIT 


151d 


156 










AO 


A5 + A6 


153 


1 55 










A5 


A5 + A6 


154 


1 56 










J 


L64 


155 


1 60e 









Notes : 



S i nae 


the 


VM 


i s 


set 


wh i 


ch 


also 


' us. 


OS 


the 


not 


cr 


ia l n . 









by the logical functional unit, this instruction, 
logical unit, delays until the unit is free and does 
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b. The vector mask instruction never chains its output to anything. 

c. While another logical vector operation using the VM-register could start 
at cycle 145 (for example, merge), the VM cannot be read out to an 
S-register until two cycles later (see the CRAY-1 Hardware Manual, 

p. 4-69 or page 122 of the online version, LCSD-158). Thus, we record 
147 as the register free cycle. 

d. This instruction is delayed one cycle since SO has not been ready for the 

necessary unused cycle. 

e. As written, this loop is taking 160 cycles for each 64 elements searched. 
Improved Version with XQR Rep'aced by Fixed Subtract 

F R Comments 

68 73 

No reason to wait 

73 77 78 Subtract each from 

target 
Get it out of the way 

78 82 84 

A4 ZS1 86 89 
JSN HIT 87 92 
J L64 89 94h 

Notes : 

f. Since the fixed subtract was used in place of the logical difference, the 

vector mask instruction can now chain its input operands. 

g. Exchanging the order of the VM transmits to S saves a cycle later on. 

h. The loop is now performing the same service as before but using only 94 
cycles for each 64 elements searched. 

This latter loop represents approximately a 40% improvement over the 
former. However, because: (1) no functional unit is used for more than 68 
cycles, (2) no register is used for more than 73 cycles, and (3) there are 
plenty of unused registers, one would expect that additional savings may be 

poss ib le . 

Another item that should be taken into consideration is that this method 
is rather inefficient for those searches in which the target value is found 

-42- 



Address 


I nstr 


uct i on 


I 


C 


L64 


VO 

A0 
V1 


,A0, 1 

A5+A6 
S4-V0 



1 
9 


g 

3 

1 4 




A5 
VM 
SO 
SI 


A5+A6 
VI , Z 
VM 
VM 


10 
14f 
84g 
85 


12 

85 
86 



in the first portion of a set of 64 elements searched. For example; suppose 
the list we are searching has 64 entries. On the average, we would expect to 
find the target value in the first half of the list as often as in the last 
half, but for all these cases, the loop as written will require the full list 
to be tested. 

In fact, there is a clever (almost heroic) method available which can go 
through this particular search loop in exactly 68 cycles per 64 elements 
searched. The treatment below, however, is somewhat easier to code (and 
debug) and offers an improvement in the time used to find the target over 
even the heroic method, on the average, for searches up to 512 in length. 

The main tricks employed are: (1) breaking the array into vectors of 
length 32 each; (2) replicating the loop but using a different set of 
V-registers for each half, (3) loading and subtracting a second set of 32 
elements while waiting for the VM instruction for the first 32 to finish, and 

(4) loading extra unneodod elements in the first half of the loop and using 
an otherwise unneeded vector operation in the second half to maintain the 
correct timing so that the lo?d-subtract-VM chain will not be broken. 

The timing chart for the main loop is given below. The notes following 

are referenced by I i ne number . 

Address Instruction I C F 

1 First half of main loop 

2 

3 

4 

5 

6 

7 l_64 

8 

9 VO ,A0,1 2 11 - 41 46 

10 
1 1 
12 
13 

14 V1 S4-V0 11 16 43 47 48 

15 

16 VM VI ,Z 16 - 48 52 54 

17 
18 
19 
20 
21 
22 
23 
24 
25 
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A2 


35 






A 5 


ADDRESS 






S4 


TARGET, 






A0 


A5 





2 


VL 


A2 


1 


2 


VO 


,A0, 1 


2 


1 1 


A6 


32 


3 


4 


S6 


A6 


4 


6 


VL 


A6 


5 


6 


A6 


A4 


6 


8 


V1 


S4-V0 


1 1 


16 


SI 


VM 


15 


16 


VM 


VI , Z 


1 6 


- 


SO 


SI 


17 


1 8 


A4 


ZS1 


18 


21 


J3N 


HIT 


20 


25 


SO 


S6-S3 


22 


25 


A6 


32 


23 


24 


A5 


A5+A6 


24 


26 


S3 


S3+S6 


25 


28 


JSP 


DUN 


27 


32 



26 

27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 



Second half of 



ma l n 


I oop 


A0 


A5 


V2 


■ AO, 1 


V3 


S4-V2 


A4 


15 


SI 


VM 


VM 


V3, Z 


SO 


SI 


VL 


A4 


A4 


2S1 


VO 


V6<A0 


JSN 


HIT 


SO 


S6-S3 


A5 


A5+A6 


S3 


S3-S6 


A4 


32 


JSM 


L64 



DUN 



A5 



A5-A6 



29 
41 
50 
51 
54 
55 
56 
57 
58 
59 
60 
62 
63 
64 
65 
67 

69 



32 
50 
55 
52 
55 

57 
58 
61 
65 
65 
65 
65 
67 
66 
72 

71 



82 
87 

74 



77 
86 



91 



78 



82 1 
87 



93 



80 



Notes: 

Line 8. Although we sra only going to check 32 elements, we take care -to 
load 35. The reason for this will appear at line 33. 

line 9. Since there are 35 elements being loaded, F = 2+35+4. 

Line 12. Now we cut the VL back to 32. Reducing the vector length in the 

middle of o chain is perfectly safe. However, increasing it while 
chaining can lead to wrong answers (i.e., the answers may differ 

depi-nd i no on external happenings such ras I/O activity, system 
interrupts, and operands out of range) . 

Line 16. The chain continues, with the functional unit becoming free at 

cycle 52, while the VM itself is not transm i ttable to S1 until 54. 

Line 29. When we reach here we are simply waiting for the previous vector 

mask instruction at line 16 to finish. Since the memory functional 
unit is free, we may as well start to load the next 32 elements. 
We choose not to load 35 elements this time. 

Line 30. The fixed adder is also free so we may as well start the next 
subtract at chain time. 

Line 32. We must rescue the previous VM register setting before we can form 
a new one. Cycle 54 is the earliest this can be done. 

Line 33. The cycle following the move of the VM to S1 is the first cycle in 
which we can start a new VM instruction. Happily, cycle 55 is also 
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the chain cycle for the subtract at line 30, so the chaining 
continues. Notice what, would have happened if we had loaded only 
32 elements at line 9. First, for that instruction, the functional 
unit, would have gone free at cycle 38. Second, the load at line 29 
would have than begun at cycle 38. Third, the subtract at line 30 
would have chained at cycle 47. Finally, the VM at line 33 would 
havo missed the chain cycle (52), since we had to hold it up for 
the move of the old VM to SI. Thus, it would not have issued till 
cycle 87. 

But since the loop will normally continue back to the VM at line 
16, and since we have not loaded 35 elements this time, we must do 
something to hole! back the loed at line 9 in the next pass, or the 
VM at line 16 will sgn i n miss its chain cycle. 

Line 3P . Here we start to pull another trick, which will delay the load at 
line 9 in the next pass and at the same time protect this loop 
against a problem (in timing, not correctness) that may arise if 
there is sn interrupt during its execution. The protection is free 
in terms of the cycles required to do it, but it does require extra 
i nstr-uct i ons . 

Line 37. This is the protection instruction. Since it is putting 15 results 
into V0 using the shift functional unit, which has a chain time of 
6, it will tie up register V0 until cycle 80. This in turn will 
cause the next load at line 9, which uses V0, to be held until 
cycle 80. This is the exact cycle desired, since it will bring the 
chain cycle from the subtract at line 14 to cycle 94, the cycle 
immediately aftsr tho one in which we can first save the VM (93). 
At the same time, regardless of whether or not some interrupt has 
come along and bollixed our careful timing, this will force the 
next load (at line 9) to hold long enough relative to the previous 
VM so that we will be back in synch thereafter. 

Line 38. In this program address HIT has already been put into an i 

instruction buffer. If this were not the case, the jump would ■ 

complete at cycle 91 . * g 

Lines 37 through 40. j 

Several instructions are completing in cycle 65j each uses a a 

different register set. U 

Line 43. After jumping back, we will be holding at line 9 for the completion 
of the instruction at line 37. The loop time will be 78 cycles for 

each 64 elements tested, but, on the average, we will exit in the 
upper half of the loop half the time, which provides a further 
speed increase, especially valuable for short arrays. 
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QVDIVO 

As another example, we present the coding for QVDIVO. the CRAY-1 

STACK LIB d i v i tie r out i ne . 

On the CRAY , the vector divide algorithm used to accomplish the FORTRAN 
vector statement C = A/B, where A, B, and C are vectors with arbitrary 
(linear) stride, requires three vector memory operations, three vector 
multiply operations, and one vector reciprocal approximation instruction for 
each 64 elements. The current CFT implementation of the general vector 
divide loop requ ' res 443 cycles per 64 elements stored plus some startup 
time, which br i nas the cost for such a divide to roughly 7 cycles per 
element. However, by overlaying the storing of the result for the first pass 
through the loop and the loading of the operands for the third pass through 
the loop with the multiplying still being carried out for the second pass, 
one can expect to achieve something on the order of twice CFT's performance. 
In fact, the theoretical minimum, 205 cycles (68 + 68 for loads + 69 for 
store) per 64 elements (after suitable startup time) is achieved in this 
routine. The timing chart for the ms i n loop is given with notes below. 

Line Instruction I C F 

-3 ----- -- 

-2 
-1 


* 

1 

2 LP 

3 

4 

5 

6 

7 

8 

9 
10 
1 1 
1 2 
13 
14 
15 
16 
1 7 
1 8 
19 
20 
21 
22 
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V6 


V2*IV1 


-137 


-128 


-73 


-69 


-64 


V4 


VI *FV6 


-64 


-55 





4 


9 


A0 


S5 


-63 


-62 








JSP 


TWOTR I P 


-62 


-48 












B U 


F F E R 


BOUNDARY 




V2 


,A0,A5 


-48 


-39 


16 


20 


25 


VL 


A4 


-44 


-43 








A3 


A5*A7 


-43 


-37 








S3 


A2 


-42 


-40 








VI 


S0 + V5 





5 


64 


68 


69 


S3 


S3<6 


1 


3 








S2 


S3 + S2 


3 


6 








V6 


V7*I VI 


5 


1 4 


69 


73 


78 


S3 


A3 


6 


8 








S5 


S5+S3 


8 


1 1 








A0 


S5 


1 1 


12 








vo 


,A0,A5 


20 


29 


- 


88 


93 


VL 


A 7 


21 


22 








V3 


V4*RV2 


73 


82 


1 37 


141 


146 


VL 


A4 


74 


75 








A0 


S2 


75 


76 








V7 


, A0, A2 


88 


97 


- 


156 


161 


V5 


/HV7 


97 


1 13 


161 


165 


177 


V2 


VOSVO 


137 


141 


201 


205 


205 


V4 


VI *FV6 


141 


150 


205 


209 


214 


VL 


A 7 


142 


143 








A0 


S6 


1 43 


144 









23 

24 

25 
26 
27 

28 
29 
30 
31 



A3 


A6*A7 


, A0, A6 


V3 


S3 


A3 


SO 


SI -S4 


SI 


SI -S4 


A 7 


A 4 


S7 


S4 


S6 


S6+S3 


JSN 


LP 



144 
156 
157 
158 
159 
160 
162 
163 
164 



150 

159 
161 
162 
162 

163 
1 69 
1 69 



220 



225 



Notes : 

Line -3. We choose to begin the timing chart somewhat before the loop. We 
havs to stnri the timing somewhere. Arbitrarily, we may take the 
start of this instruction as any cycle. Cycle -137 will be 
co riven i ent . 

Line -2. At this point, it is clear that, the state of the machine prior to 
line -3 will have no effect on the issue time of this instruction. 
(Actually, a vector reciprocal instruction whose result register 
was V4 could still be in progress and would delay this issue by a 
few cycles, but that is not the case. ) 

L i nes and 1 . 

The jump here to TWOTRIP is not taken. However, a 16-word buffer 
boundary (20 octal) occurs after the JSP instruction, and this 
delays the next instruction until the new buffer can be loaded from 
memory. Notice that the time of issue of the instruction at line 1 
after the buffer load is the same as it would have been had a jump 
been taken to it. 



Line 5 . 



L i ne 8 . 



L i ne 12. 



L i ne 14. 



This move i nst r uct ion is 

Wo have arranged to make 



cycits kj . ii will wo i u uu 

operands for the multiply 



the first vector instruction in the loop. 

it issue at cycle 0. It will wait to 
issue until Vf has delivered all the 
instruction at line -2. 

This multiply chains with the fixed add (move) at line 5. We have 
insured chaining by delaying the move long enough to have the 
multiply functional unit free from line -2. 

This load will issue as soon as the previous one at 

the memory (cycle 20). 

V2, V3, and V4 have been available for many cycles before this 
instruction can issue. It has to wait for the use of the multiply 
unit. Note else that the A-register multiplies do not interfere 
with the f lost i r g-po i nt multiplies since they are done in a 
separate functional unit. 



I i ne 1 releases 



Lines 17 and 18. 
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These i nstruct i oris chain. 

Line 19. This is another move instruction. The release of VO by this 
instruction determines the length of the loop (205 cycles). 

Line 20. This does not chain with the move at Line 19. It issues at cycle 
141 bec;iu£p it can't get at the multiply unit from line 14 before 
then . 

Line 24. This is the Final store instruction. It is released for issue by 

the availability of the memory from line 17. The memory functional 
unit also determines the time for the loop since we are using it 

for 60 I'SS-; 59-205 cycles. 

The following is a timing and accuracy test for OVDIV0: 

* RCFT I=TEPTQVD J 0N=G J B=BQVD J C=C00 

* LDR I -CBGV0I VjE.QVD) , X = XGVD , ORDER=CLNB , F I RST = BQVDI V 

* XQVO 

COMMON /QVCOM/ X( 48000) ,W( 48000) , U( 48000) , 2(48000) 
CALL L I NK ( ' UN I T59= ( TTY , TE.ST ) // ' ) 
DO 3 L = 1, 12000, 64 
DO 2 1=1, 3*L + 1 
Z( I ) = 4+L 

2 U( I ) = 4+1 
K = IRTC(0) 

CALL QVD IV0(W(1) ] U(1) J Z(2) J L,4 J 3 J 2) 

N = IRTC(O) 

N = N-K 

K = I RTC ( ) 

DO 1 1=0,L-1 

X(4*I+1) = U(3* I +1 )/Z(2»I +2) 
1 CONTINUE 

M = I RTC ( ) 

M = M-K 

DO 4 I =0, U- 1 

I F(X*I +1 ) . NE. W(4*I +1 ) ) GO TO 5 
4 CONTINUE 

HRlTE(59,SO) L,M,N 
60 FORMAT ( 16, 216) 

3 CONTINUE 
STOP 1 

5 CONTINUE 

WR I TEC 59. 59) W ( 4* I + 1 ) , X ( 4* I +1 ) 

WRITE (59; 61) (W(I) l I = 1 J 4*L-3 J 4) J (X(I) J I=1 J 4*L-3,4) 
59 FORMAT (3M 6. 1 4) 
61 FORMAT (3022) 

STOP 

END 
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QVSQRTH 

We conclude our examples with the code for QVSQRTH, a half precise 

(2&-bit-a'xurate) square root routine for arrays, available? in STACKLIB. The 
full-precision routine QVSQRT is quite similar, requiring one additional 
iteration but needing, also, a full precision divide during this final 
iteration. The code is perhaps remarkable in that maximum speed is obtained 
by breaking the ai ray up into vectors of length 31, and because every vector 
operation is chained to the previous one. A total of 21 consecutive chained 
vector- operations occur. 

Essentially, the idea is to compute an initial guess XO and then to 
iterate three times by the formula: Xi+1 = (Xi + Y/Xi)/2, where Y is the 
number whose square root is desired. The iterative loop can be managed by 
the four CAL instructions: 

VO /HV1 
V2 V0*FV3 
V4 V2+FV1 
V5 S4+V4 

The halving operation is performed by adding minus one to the exponent. 
Chaining will end for long vectors at the (+F) instruction since there will 
be a conflict over the use of register V1 . However, by adding one auxiliary 
NO-OF' instruction (a shift of zero), we can achieve the following timing for 
vectors of length 31, since the +F is delayed until VI is free. 

I C F R 

VO /HV1 16 31 35 47 

V6 V0*FV3 16 25 47 51 56 

V2 VS>A7 25 31 56 60 62 

V4 V2+FV1 31 39 62 66 70 

V5 S4+V4 39 43 70 74 75 

Now, at cycle 43, we can issue another reciprocal operation (to register 
V7) and continue the procedure without any breaks in the chain. Moreover, 
since the initial guess can be generated by a similar sot of chained 
operations, the entire calculation may proceed from the initial load, with 
each successive vector instruction issuing at the chain cycle of the previous 
one. (In the f ul I -prec is i on routine, the chain is broken during the 
calculation of the f ul I -prec is i on reciprocal.) 

The timing chart for this half-precise square root is given next (for 
the main loop). A full iteration begins at label ITER. The complete routine 

is available in file CLASS. 
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Address 




I nstruct i on 


I 


C 





F 


R 


LOOP 


V5 


VO*FV1 





9 


31 


35 


40 




V6 


V5>A7 


9 


15 


40 


44 


46 




A 7 


24 


10 


1 1 










A7 


V6+FV7 


15 


23 


46 


60 


54 




AO 


A1 +A7 


17 


19 










A 7 


A7-A6 


18 


20 










V3 


S5 + V2 


23 


28 


54 


58 


59 




SO 


+ A7 


24 


26 










A7 


-A7 


25 


27 










S7 


VM 


26 


27 










V4 


/HV3 


28 


44 


59 


63 


75 




JSP 


NOLOD 


29 


34 










VL 


A7 


31 


32 










VO 


, AO, A3 


32 


41 


39 


43 


48 




VM 


VO, 2 


41 


52 


48 


52 


54 




VL 


A6 


42 


43 








NOLOD 


V5 


V4*FV1 


44 


53 


75 


79 


84 




A7 





45 


46 










VO 


V5>A7 


53 


59 


84 


88 


90 




S7 


S6&S7 


54 


55 










A7 


A6*A3 


55 


61 










S2 


VM 


56 


57 










V6 


V0+FV3 


59 


67 


90 


94 


98 




S2 


S2>24 


60 


62 










S7 


S2!S7 


62 


63 










VM 


S7 


63 


66 










A1 


A1+A7 


64 


66 










V2 


S4! V6&VM 


67 


71 


98 


102 


102 




AO 


A6-A5 


68 


70 










A7 


A6*A4 


69 


75 










A 5 


A5-A6 


70 


72 










V7 


S5 + V2 


71 


76 


102 


106 


107 




JAP 


DUN 


72 


77 










AO 


A6-A5 


74 


76 










JAP 


SH0RT2 


78 


83 








ITER 


A1 


A1 


80 


82 










AO 


A1 


82 


84 










VI 


, AO, A3 


84 


93 


1 15 


1 19 


124 


RTN2 


VO 


SI *FV1 


93 


102 


124 


128 


133 




S2 


>2 


94 


95 










S2 


S2>15 


95 


97 










V2 


S2! VO 


102 


106 


133 


137 


137 




V3 


\/2>A0 


106 


1 12 


137 


141 


143 




AO 


A2 


107 


109 










A2 


A2+A7 


108 


1 10 










V4 


S3+V3 


1 12 


1 17 


143 


147 


148 




V5 


/HV4 


1 17 


133 


148 


152 


164 




,AO 


, A4 V7 


1 1 8 


- 


149 


154 


- 
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V6 


V5*FV1 


133 


142 


1 64 


168 


173 


A7 


24 


134 


135 








VL 


A7 


135 


136 








VM 


VO, Z 


137 


165 


161 


165 


167 


VL 


A6 


138 


1 39 








A 7 





140 


141 








V2 


V6>A7 


142 


148 


173 


177 


179 


V3 


V2+FV4 


148 


156 


179 


183 


187 


V7 


35+V3 


15S 


161 


187 


191 


192 


VO 


/HV7 


161 


177 


192 


196 


208 


J 


LOOP 


162 


167 









APPENDIX fi> . AN ABRIDGEMENT OF THE SUMMARY OF CPU TIMING INFORMATION 

FURNISHED BY CRAY RESEARCH INC. 

When issue conditions are satisfied, an instruction completes in a fixed ■ 

amount of time. Instruction issue mey cause reservations to be placed on a 1 

functional unit or registers. Knowledge of the issue conditions, instruction ■ 

execution times and reservations permit accurate timing of code sequences. 1 

Memory bank conflicts due to I/O activity are the only element of ■ 

unprc-d i ctab i I i ty . 5 

SCALAR INSTRUCTIONS I 



Four conditions must, be satisfied for issue of a scalar instruction: I 

M 

1. The functional unit, must be free. No conflicts can arise with other ■ 
scalar instructions. However, vector floating point instructions reserve I 
the floating point units. Memory references mav be delayed due to ■ 
conf I icts . g 

■ 

2. The result register must be free. g 

3. The operand register must be free. g 

m 

4. Issue is delayed 1 clock period if a result register group input path ■ 
conflict would exist with a previously issued instruction. One input ■ 
path exists for each of the four register groups (A, B, S and T). 1 

Scalar instructions place reservations only on result registers. A 
result register is reserved for the execution time of the instruction. No 
reservations are placed on the functional unit or operand registers. 

A transmit scalar mask instruction to Si (073) instruction is delayed by 
(VL) + 6 clock periods from the issue of a previous vector mask (175) 
instruction, end is delayed by 6 clock periods from the issue of a preceding 
trpnsmit (Sj) to VM (003) instruction. 



u 
rj 

■•-. 

m 

m 



Execution times in clock periods are given below. An asterisk indicates 

that issue may be delayed because of a functional unit reservation by a H 

vector instruction. Memory may bs considered a functional unit for timing S 

cons .' derat i ons . ~ ■ 
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(A-A-register, M = Momory, B = B-register, S=S-reg i ster , ^Immediate, i 

C=Channel, T=T -register, V=V-reglster , * see previous page) i 

H 

i 



24 -bit results: 



A<--M 1 1 * A<--C 4 m 

M< -A 1* A<--A+A 2 if 

A<--B 1 A<--AxA 6 1 

B<--A 1 A<--pop(S) 4 i 

A<--S 1 A<--lzc(S) 3 i 

A<--I 1 VL<--A 1 a 

m 

64 -bit, results: g 

■ 

S<--M 11* S<--S+S 3 ■ 

M<--S 1* S<-S(f.add)S 6* H 

S<--T 1 S<---S(f .mult)S 7* m 

T<--S 1 S<--(r.a. )S 14* ■ 

S<--I 1 S<--V 5 | 

S<-$( 1. oq)S 1 V<--S 1 ■ 

S< Sfshift )I 2 S<--VM 1 m 

S<~5Cshift) 3 S<--RTC 1 ■ 

S<- -S(mask) 1 S<--A 2 a 

RTC<--S 1 VM<--S 3 1 

Vector Instructions g 

H 
Four conditions must be satisfied for issue of a vector instruction: I 

il 

1. The -Functional unit must be free. (Conflicts may occur with vector i 
operat i ons . ) g 

2. The result register must be free. (Conflicts may occur with vector B 
operat i ons . ) g 

3. The operand registers must be free or at chain slot time. 1 

4. Memory must bo quiet if the instruction references memory. a 

1 

Vector instructions place reservations on functional units and registers 1 

for the duration of execution. " g 

D 

1. Functional units are reserved for (VD+4 clock, periods. Memory is 1 

reserved for (VD+5 clock, periods on a write operation, (VD+4 clock 1 

periods on 71 read operation. II 

m 

■ 
in 
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3. Vector operand registers are reserved for (VL) clock periods. Vector ■ 

operand rogisters are resei ved for 5 clock periods if the vector length ■ 

is less than 5. The vector register used in a block store to memory (177 m 

instruction) is reserved for (VL) clock periods. Scalar operand a 

registers are not reserved. a 

■ 

Vector instructions produce one result per clock period. The functional ■ 

unit times are given below. The vector read and write instructions (176., I 

17/) produce results more slowly if bank conflicts arise due to the increment U 

value (Ak) being a multiple of 8. Chaining cannot occur for the vector read ■ 

opera"; i on in this case. g 

■ 

If (Ak) is an odd multiple of 8(*), results are produced every 2 clock i 

periods. If (Ak) is an even multiple of 8(*), results are produced every 4 i 

clock periods. g 

H 
Memory must be quiet before issue of the B and T register block copy i 
instructions (034-037). Subsequent instructions may not issue for 14+(Ai) ■ 
clock periods if (Ai).NE.O and 5 clock periods if (Ai)=0 when reading data to ■ 
the B and T registers (034,036). They may not issue for 6+(Ai) clock periods I 
when storing data (035,037). B 

a 

The B and T register block read (034,036) instructions require that B 
there be no register reservation on the A and S registers, respectively, ■ 
before issue. g 

m 

Branch instructions cannot issue until the A0 or SO operand register has ■ 

been free for two clock periods. Fa I I -through in buffer requires two clock H 

periods. Branch - i n -buffer requires five clock periods. When an "out of P 

buffer" condition occurs the execution time for a branch instruction is 14 ■ 

clock periods. (18 clock periods for 8 -bank phas i nq option.) 1 

S 

A two parcel instruction takes two clock periods to issue. ■ 

I 

Instruction issue is delayed 2 clock periods when the next instruction i 

parcel is in a different instruction parcel buffer. Instruction issue is I 

delayed 12 clock periods if the next instruction parcel is not. i n an i 

instruction parcel buffer. g 



* Multiple of 4 for 8 bank phasing option. 
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HOLD MEMORY i 

■ 

A delay of I, 2, or 3 CP will be added to a scalar memory read if a bank ■ 

conflict, occurs with rank C, B, or A, respectively, of the memory access ■ 

network. A conflict, occurs if the address is in the same bank as the address ■ 

in rank C, B, or A. Conflicts can occur only with scalar or 1/0 references. ■ 

The scalar instruction senses the conflict condition at issue time + 1 CP . i 

The scalar instruction address enters rank A of the memory access network at ■ 

issue time + 1 CP. The scalar instruction address enters rank B at issue + 2 1 

CP . The scalar instruction address enters rank C at issue + 3 CP . ■ 

■ 

Scalar load instruction timing (no conflict): i 

1 

CP n Issue, reserve register ■ 

CP n+1 Address rank A, sense conflict B 

CP rvi-2 Address rank B ■ 

CP n + 3 Addresr, rank C H 

a 
■ 

CP nHO Clear register reservation ■ 

CP n+11 Complete and issue waiting instruction D 
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APPENDIX B. WHAT HAPPENS WHEN YOU RUN ON THE CRAY? 

You type your LOGON at a terminal; then: (*) 

1. The LOGON line goes to the COMBO checker., which verifies it, appends some 
bits of information and sends it on. 

2. The line next arrives at the TMDS concentrator, which notes that it is 
destined for the CRAY and routes it to the A410. 

3. The A410 perforins the appropriate protocol and drops the line onto the 

NSC bus . 

4. The A130, which is attached to a CRAY channel, picks up the line from the 
bus and sends it along the CRAY channel to an LTSS memory buffer. 

5. LTSS, which is frequently polling all CRAY channels, notices the 
activity, sees that this is © LOGON line, and verifies that you are an 
authorised user. 

6. LTSS then prepares an index of private and public disk files to which you 

have accr?s and associates it with your user number. 

7. LTSS returns an appropriate acknowledgment of your LOGON and sends it on 

the rsv&r^e route tc your teletype. 

The acknowledgment response and all subsequent message lines bypass the COMBO 
checker. In fact, if the COMBO checker was down at initial LOGON time, the 
LOGON line would go directly to the TMDS concentrator. 



* For the MFE network, replace items 1 through 4 above by the following: 

Ml. The LOGON line goes via a modem and telephone lines to a VADIC modem 

multiplexor, which sends it on (or it may go directly to step M2). 

M2. A PDP-1! concentrator then notes that it is destined for the CRAY and 

routes the line to a 7600 PPU (12). (In the future, another PDP-11 will be 

used . ) 

M3. The PPU performs the necessary protocol and sends the lines to the 

CRAY -7600 Adaptor. 

M4 . The adaptor, which is attached to a CRAY Channel, picks up the line and 

sends it along to a CTSS memory buffer. 
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Next, you type in an EXECUTE line, say, CLASS / 1 .7, which goes to CRAY 
LTSS, 

1. A search is made of your private file index to determine whether you have 
a file by the :iarr,e of CLASS. 

2. If not, a search is made of your PUBLIC file index to see if it has a 

file by t hat. name . 

3. If not, the message "MO FILE" is sent to your terminal. 

4. When CLASS is found, your PRIORITY is checked (V/TL = .7), and if 
necessary, changed to conform to the current limits, or, if your account 

has no time left, changed to S (standby). 

5 The job is then assigned to an appropriate loading queue and, when memory 
space is available, a number of words equal to the load length of this 

file is brought into memory. 

6 When the file is in memory , LTSS performs a sequence of validity checks 
on the minus words. If any check fails, an appropriate message is 
returned to your terminal, and execution ceases. 

7. If all seems well, the job is placed in an appropriate queue and 
scheduled for CPU time. 

8 When the proper time arrives, LTSS relinquishes control of the CRAY CPU 

to "our p'T-ram by exchanninc from MONITOR to JOB mode, putting the 
contents of" your minus words into the CRAY registers, and requesting the 

16-word buPfe load of instructions contains no the instruction addressed 

by your p ■ogram counter to be fetched to an instruction buffer. 

9. Finally, then, the first instruction will be performed and the program 
counter advanced to the next instruction. 

10 In qeneral, your program continues in control of the CPU until it makes a 
rpcoani7Pd error, gives control back to LTSS, or is interrupted by LTSS. 
However," while it is in control of the CPU, LTSS may have on-going I/O 
activity, which will share the use of memory with your program, 



APPENDIX C. THE DETAILS OF INSTRUCTION FETCH TIMING 

All this detail is incorporated in the code CYCLES. 

There are essentially five registers to consider, a few Flags and a few 
time positions. 



I 



I BOO !<■ 



— > i 



i nstruct i on 
Buffers 



+ 1 



■>l I LATCH 



•> : 



NIP 



: -> i 



CIP 



-> : 



LIP 



Execut i on 
-> 



m 
1 

a 

i 
e 






An instruction which issues at cycle x must have entered the CIP at 
cycle x-1 or before, the NIP at cycle x-2 or before, and the ILATCH at x-3 or 
before Some time prior to cycle x-3, the instruction must have been located 
in one of the four 6*1 -parcel instruction buffers, and before that, it was in 
memory . 

In general, instructions coming from the instruction buffers are able to 
reach the CEP at a rate of one per cycle; however, when the end of a buffer 

is reached, delays are encountered in locating the next instruction to be 
processed. Similarlv, whenever Branch instructions cause the orderly flow of 
sequential instructions to be interrupted, delays are to be expected. 

The chart (pages 60-61) illustrates details of the flow of instruction 

parcels in the CRAY-1. Registers involved in this flow are described in the 

"Instruction Issue and Control" section of Chapter 3 of the CRAY Hardware 

Reference Manual. 

In general, the P register is incremented by one each time an 
instruction is issued. If the instruction parcel corresponding to the new P 
value in sequence is ir the current instruction buffer, then that parcel goes 
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T.o the I LATCH register during the same cycle. If the par-eel is not in the ■ 
current I -buffer, then the I LATCH INVALID flag is set. B 

m 

If the required parcel is not in any I -buffer, then a memory instruction ■ 

fetch request ( I FR ) is issued. Normally, four instruction words (16 parcels) i 

including the required parcel w i I I arrive in the next I -buffer eleven cycles ■ 

after the IhR. If memory is already busy then the I FR must wait. The other ■ 

twelve instruction words to fill the I -buffer w i I I be requested in groups of ■ 

four during the next three cycles. The required parcel reaches I LATCH in the ■ 

same cycle it roaches the I -buffer. I -buf f ers are loaded in strict rotation ■ 

regardless of when the buffer was used last. a 

If the required parcel is already in a different I -buffer, then CHANGE S 

BUS FER is set and on the following cycle the current I -buffer designator is ■ 

switched. The correct parcel will reach I LATCH on the following cycle, two 1 

cycles delayed. A jump within the current I -buffer takes as long as a jump H 

to a different I -buf f er . " j§ 

II 

An instruction issues from the CIP (current instruction parcel) 1 

register. The second parcel of a two-parcel instruction issues from the LIP ■ 

(lower instruction parcel) register. In the same cycle a new parcel moves ■ 

into CIP from the NIP (next instruction parcel) register unless blocked by ■ 

the TPS (two parcel split) flag. The TPS flag is set when I LATCH is invalid ■ 

and NIP contains the first parcel of a two parcel instruction. (17d) 1 

In the same cycle that a parcel moves from NIP to CIP, a parcel moves I 

from I LATCH to NIP unless blocked by the I LATCH INVALID flag described above. I 

If NIP contained the first parcel of a two parcel instruction, then the § 

parcel in I LATCH goes to LIP instead, and a NOP is placid in NIP. H 

With these rules we are now ready to use the chart below which i 

illustrates the cycle- -by-cyc le progress of instruction parcels for the ■ 
following code sequence: g 

■ 
addr parcel CAL mnemonics (j 

II 
11 
II 
II 
11 
II 
II 
II 
il 
H 



17a 


072700 


s7 rt 


17b 


020100 


al two 


17c 


000002 


* repeat 1 


17d 


031 1 1 


al al-1 


20a 


030001 


aO al 


20b 


01 1 000 


jan *-2 


20c 


000077 




20d 


072600 


s6 rt 


21a 


004000 


ex 
two = 2 



Assume 
lis'tor 'to 



that, completion of 
1 /a in eye 1 e 1 . 



an exchange sequence results in setting the P 



show 

i r relevant. 



I FR means ' 
nip entry 



i nstxuct i on 
b I ockec 



,. , — ™ — . , issued for these words 
because invalid data : - '' — --■- j 



fetch request' 



in i latch. - means 



x 

i nval 



column 

i d on 



cycle 

1 

2 
3 

4 
5 
6 
7 
8 
9 
1 
1 1 
12 
13 
1 4 
1 5 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 



I FR words 

words ready 



14- 
0- 
4- 

10- 



17 
3 

7 

13 



20- 
24- 
30- 
34- 



23 
27 
33 
37 



14- 
0- 
4- 

10- 



1 7 
3 
7 

13 



20-23 
24-27 
30-33 

34-37 



P 

res 

17a 

17a 
17a 
17a 
1 7a 
17© 
17a 
17a 
17a 
1 7a 
17a 
1 /a 
17b 
17c 
17d 
20a 
20a 
20a 
20a 
17a 
1 7a 
17a 
17a 
1 7a 
1 7a 
1 7a 
20a 
20b 
20c 
20d 
20d 
20d 
20d 
17d 
1 7d 
17d 
20a 
20a 
20a 
20b 



i latch x nip (lip) cip 



17a 
17b 
17c 

17d 



20a 
20b 
20c 
20d 

20d 
20d 
20d 



17d 



20a 
20b 



x 
x 

X 

X 
X 
X 
X 
X 
X 
X 
X 



17a 
1 7b 

nop 
1 7d 



20a 
20b 
nop 
nop 
nop 
nop 



17d 



20a 



(17c) 



17a 
17b 

nop 
17d 



l nstr . 

issued comments 

IFR for 14a-17d 
(ready in I -buf f er 
1 1 cycles after 
memory request) 
wa it i ng 
for 
instructions 
to 
arrive 
from 
memory 
11 cycles after IFR 

17a s7 = rtc at this cycle 

17b IFR for 20a-23d 

nop a1 now set to 2 

1 7d a1-1 to address adder 

al now set to 1 
wa it i ng for 
i nstruct i ons 
to 
arrive 
from 
memory 
11 cycles after IFR 



(20c) 
(20c) 
(20c) 
(20c) 



20a 
20b 
20b 
20b 
20b 
nop 



1 7d 



20a 



20b 
nop 



17d 



0+al to address adder 

aO ready ( = 1 ) 

a-branch flags set 

1 7d goes to p-counter 



change buffer request 
al-1 to address adder 



a 
a 
i 
a 
ii 
in 
ii 
■ 
ii 
n 



ii 
ii 
ii 
a 
ii 
ii 
m 
m 
m 
a 
a 
ii 
m 
u 
ii 
li 
81 
81 



ii 
11 
m 
ii 
ii 

81 

li 
1! 
EI 
II 

m 
ii 
ii 
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41 
42 
43 
44 
45 
46 
47 
48 
49 

eye I c 
1 

12 
15 
16 



30 
3? 



33 
34 



37 

39 

42 

46 

43 



20c 
20d 

20d 
20d 

aod 

21b 

21c 
21d 



20c 
20d 
20d 
20d 
20d 
21a 
21b 
21c 
21d 



20b 

nop 
nop 
nop 
nop 
20d 
21a 
21b 
21c 



(20c) 
(20c) 
(20c) 
(20c) 



20a 
20b 
20b 
20b 
20b 
nop 
20d 
21a 
21b 



20a 



20b 
nop 
20d 
21a 



0+a1 "to address adder 

aO ready (=0) 
a-branch flags set 
(drop through) 

s5 = rtc = s7+33 
ex it 



j notes 

words 14-17 are requested 
words 14-17 reach I -buffer 
parcel 17a issues fourteen 
17b issues e.nd parcel 20a 
in general j the next buffe 
old buffer. If 20a is not 
issue after four-toon more 
parcel 20a issues fourteen 
register aO = Q+a'l is ready 
s-ett i ng un i t . Th i s wo'.j I d 
on A0 instructions. 
the A0 branch flags a 
now the Jump on A0 No 
A jump to a per eel a I 
parcel to issue. 
parcel 20a is renuost 
and will be i n I LATCH 
tar-get parcel 1 7d iss 
parcel 20a issues as 
JAN issues but this t 
through . 
the real -time clock r 



from memory. 

and parcel 17a enters I LATCH. 

cycles after being requested from memory. 
(words 20-23) is requested from memory. 
r is requested when 17b issues from the 

in sn I -buffer then it will be ready to 
cycles, unless further delayed by memory busy, 

cycles after 17b issued and I FR . 
The result is sent to the A0 branch flag 
not delay instructions other than jump 



HE 

11 



re set . 

n-zero can issue which resets the P register. 

ready in an i -buffer takes 5 cycles for the target 

ed when 1 7d leaves ILATCH. 20a is in an I -buffer 

in two cycles, 
ues and 20a reaches ILATCH. 
in cycle 30. 
ime the P register is not reset and we drop 

ead i ng would be 33 cycles greater than cycle 15. 



■ 

H 

i 
n 



CYCLES' output for this code sequence: 
loc instr res operand w b delay 



00017a 


072700 


s7 


rt 


A20000 


15 


16 


00017b 


0201 00000002 


al 


two 




16 


17 


00017d 


031110 


a1 


al -1 




18 


20 


00020a 


030001 


aO 


al 


1 1B00204 


30 


32 


OOOSOb 


011 00 r '000i7d 


jan 


17d 


3 00100 


34 


39 


jump back to rope 


at at 


17d 








00017c! 


031 1 10 


a1 


al -1 


a 


39 


41 


0002 If-i 


030001 


aO 


al 


2b00204 


42 


44 


00020b 


011 0000001 7d 


jan 


17d 


3 00100 


46 


51 


00020.J 


072500 


s6 


rt 




48 


49 


ooor'io 


004000 


ex 




1 02000 


50 


100 



39 



51 



2=a1 


■ 


1=a1 


91 


1 =a0 


B 


48 


1 




|| 




■ 




i 


= a1 


B 


= a0 


1 


60 


1 




3) 




II 
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TTY input to CYCLES for this example: ■ 

1 
cycles tty htty. m 

pi 7a c15 72700 20100 2 31110 30001 11000 77 ■ 

pi 7a 31110 30001 11000 77 72600 4000 end a 

M 
Summary g 

Instruction look ahead is effectively three parcels (CIP, NIP, and ■ 

I LATCH). When instruction 17b of a buffer is issued, the first parcel (20a) ■ 

of the next I -buffer load is sought. If parcel 20a is already in an [-buffer ■ 

then it is delayed only 2 cycles; if it is not in a buffer, then it should be 1 

ready to issue fourteen cycles after it was requested (ie. after 17b 1 

issued). The request is delayed until memory is not busy. After the request B 

is accepted memory is busy for six additional cycles. H 

1 
Thers are four exceptional cases to consider: i 

1. If 17c is a branch instruction, then the instruction fetch request (IFR) ■ 
is delayed until the jump address is decided. The address is decided in ■ 
the jump issue cycle except for "J Bjk" in which it is decided two cycles ■ 
later. H 

m 

2. If 17c is a scalar load or store which issues immediately, then it gets 3 
memory service first and the instruction fetch is delayed four cycles. ■ 

I 

3. If 17c is a vector load or store or a block register transfer and it 1 
issues immediately, then the instruction fetch is delayed until 17c is 8 
done with memory. The delay will be VL + 4 for a load and VL+5 for a 1 
store. g 

4. If 17c is a one parcel instruction followed by a two parcel instruction, 1 
then if 17c does not issue immediately, it will be held from issue until I 
the second parcel of 1 7d reaches I LATCH . The hold is caused by the ■ 
setting of the TPS (two parcel split) flag after 1 7d reaches NIP. ■ 

H 

The following sequences, which differ only by the second instruction ■ 

issued (at cycle 2 or 1), illustrates this effect: 1 
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1 oc 


i nstr 


res 


operand 


w b delay 


i 


c 


0001 7a 


061 106 


si 


-s6 


A 





3 


0001 7b 


054521 


s5 


s5<1 7 


1 00020 


2 


4 


00017c 


070210 


s2 


/hsl 




3 


1 7 


0001 7d 


1305 0f>0 10000 


10000b, s5 


1 1B00200 


15 




00020b 


064432 


s4 


s3*fs2 




17 


24 


00020c 


1304 00010001 


IOOOIb.0 s4 


6 00004 


24 




loc 


i nstr 


res 


operand 


w b delay 


i 


c 


00017a 


061106 


si 


-s6 


A 





3 


00017b 


42521 


s5 


<47 




1 


2 


00017c 


070210 


s2 


/hs1 


1 1 00204 


1 3 


27 


0001 7d 


1305 00010000 


10000b. s5 


B 


1 4 




00020b 


064432 


s4 


s3*fs2 


1 1 00004 


27 


34 


00020c 


1304 00010001 


1 0001b, s4 


6 00004 


34 





i 

B 

I 

:; 

D 
■ 

■ 




a 

B 
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DISCLAIMER 

Th i s document was prepared as an account 
of work sponsored by an agency of the 
United States (government. Neither the 
United States Government nor the 
University of California rior any of 
their employees, makes any warranty, 
express or implied, or assumes any legal 
liability or responsibility for the 
accuracy, completeness or usefulness of 
any information, apparatus, product or 
process disclosed, or represents that 
its use would not infringe privately 
owned rights. Reference herein to any 
specific commercial products, process, 
or service by trade name, trademark, 
manufacturer , or otherwise, does not 
ncces •: ar i ly constitute or imply its 
endorsement, recommendation, or favoring 
by the United States Government or the 
University of California. The views and 
opinions of authors expressed herein do 
not necessarily state or reflect those 
of the United States Government thereof, 
and shall not be used for advertising or 
product endorsement purposes. 



Work performed under the auspices of the 
U.S. Department of Energy by Lawrence 
Liver-more National Laboratory under 
contract number W-7405--l"ng-48 , 



-64- 



